Team LiB   Previous Section   Next Section

11.2 Monitoring the Databases

Another very high priority for Oracle DBAs is the monitoring of database connectivity. You need to constantly poll all of the databases at your site to make sure they are up and running. If a database is unavailable for any reason, you must be proactive and check out the problem as soon as possible. It is definitely not good practice to wait until your users discover a connectivity problem before you take care of it. It's far better to deal with such problems as soon as they occur in order to minimize the impact on users. Much as we dislike the dissonant refrains of a pager in the small hours, we still prefer that to having users inform us of database problems as we stroll innocently into the office the next morning! As databases grow more critical to business, the importance of uptime continues to grow too. It's clear that we need a tool to help us maintain 24x7 connectivity, or at least edge us closer to this ideal state.

We've included a collection of database monitoring scripts in our PDBA Toolkit that will help you keep track of database connectivity and alert you when something goes wrong. The scripts described in the following sections continually poll databases to make sure they're up and running. When a database is unavailable, the on-call DBA is immediately emailed and paged, and every member of the DBA team also receives an email. Using our scripts, you can configure the emailing and timing to meet your specific needs.

11.2.1 Monitoring Database Connectivity with dbup.pl and dbpu_NT.pl

We've written a pair of connectivity monitoring scripts: dbup.pl for Unix and dbup_NT.pl for Win32. Both are highly configurable and offer these features:

Database uptime

You specify the time periods that a specific database should be up. If the database goes down outside these hours, the DBA gets emailed but is not paged.

DBA rotation

You can create an on-call rotation for DBAs, and you can specify a default DBA to cover any exposed gaps you may choose to leave. We'll show a sample DBA schedule later.

Supervisor notification

You may optionally page a supervisor (or anyone else) when a database fails, regardless of the DBA that is on call. You can also email the supervisor.

Connection test interval

You can configure the intervals between database connection attempts.

Delayed paging

You can also delay paging during off-duty hours. The configuration parameter hoursToPageImmediate determines the hour range when paging is suspended. The parameter maxConnectRetries determines the number of connection attempts before the on-call DBA is paged. This prevents the DBA or supervisor from being paged during the night if database connectivity is restored within a preset number of reconnection attempts. This is useful for situations that disrupt database connectivity for a few minutes without causing a true problem — for example, evening reboots that occur on a standard schedule. Email and event logging still take place.

Impressing your system administrators

OK, this is hardly built into the system, but it does tend to happen anyway. When one of your database servers decides to head for the land of dreams, you're naturally going to get paged when the connectivity monitor cannot connect to the database server. If you configure dbup.pl to connect to each database every five minutes, it won't be long before you're aware that the database server itself is down. Sysadmins are always somewhat amazed to find out that you know about a server outage before they do. We've found this goes a long way towards maintaining cordial relations with our own systadmins.

11.2.2 Installing Additional Modules

In order to properly manipulate dates in our dbup.pl monitoring script, we've chosen to use Sullivan Beck's Date::Manip module, an incredibly flexible pure Perl module used for parsing and comparing dates. It allows you to specify a date in literally hundreds of formats, recognizing them automagically and parsing them into its own internal operational format.[5] This module allows dbup.pl to handle the time intervals you specify.

[5] Appendix D, also describes Date::Calc, a more rapid, less diverse, C-based date-manipulation module.

Because the Data::Manip module is not part of the regular Perl distribution, you will need to install it before continuing with the configuration of dbup.pl.

11.2.2.1 Installing Date::Manip on Unix

Download the latest version of Date::Manip from a CPAN site near you:

http://www.cpan.org/authors/id/SBECK

Follow the usual drill to unpack its tarball and install it, or alternatively use the CPAN shell as follows:

$ perl -MCPAN -e "shell"
cpan> install Date::Manip
cpan> quit
11.2.2.2 Installing Date::Manip on Win32

Connect to the Internet and perform the usual ActivePerl maneuver:

C:> ppm
PPM> install Date::Manip
PPM> quit

Up until now, most of the Win32 Perl modules we've installed via PPM have been fairly simple to install. Date::Manip is a bit more complex. We don't have to compile anything, but we do need to do some editing, as we'll describe in the next section.

11.2.2.3 TZ — Time Zones

The Date::Manip module needs to be able to get time zone information from the machine it's running on. This is not a problem on Unix platforms, which give up time zone data quite readily. However, obtaining this information on Win32 is a little more involved. The Date::Manip documentation describes a number of methods for setting the attributes of the time zone environment variable, TZ, but we've found only one reliable way on Win32 platforms, and that's to use the Date::Manip configuration file, Manip.cnf. This file is not normally configured when you install the Date::Manip module with PPM. Follow these steps to obtain and edit this file:

  1. Copy a Date::Manip configuration file to PDBA_HOME. There are two ways to obtain this file. The first and easiest way is to copy it from the PDBA Toolkit distribution as we've already included it for you there:

    C:> copy C:\perl\site\lib\PDBA\conf\Manip.conf C:\pdba
  2. The other way is to download the Date::Manip tarball from CPAN. The current version of Date::Manip, as of this writing, was 5.40. If the version installed on your system is different, you may wish to install the configuration file from the latest version of Date::Manip, which can be found at http://www.cpan.org/authors/id/SBECK. Open the archive file, locate the file Manip.cnf, and save it to PDBA_HOME as shown in Figure 11-4.

    Figure 11-4. Extracting Manip.cnf from the archive
    figs/pdba_1104.gif
  3. Rename the file from Manip.cnf to Manip.conf. (This is simply to stay consistent when naming configuration files and thus avoiding later confusion.) You can rename the file by right-clicking on the file in Windows Explorer, and clicking on Rename. Alternatively, do this from the command line:

    C:> move C:\pdba\Manip.cnf C:\pdba\Manip.conf
  4. Next up sports fans, edit the file:

    C:> vi C:\pdba\Manip.conf

    There are two entries for TZ that you may need to edit: TZ and ConvTZ. The following example shows where they're usually located in the Manip.conf file. If your time zone[6] happens to be PST (Pacific Standard Time), you can save the file the way it came out of the box (which is nice, if you live in Oregon, USA, but less nice if you live in Oxfordshire, England). If you're not sure of your exact time zone, you can check the full documentation for yourself via perldoc Date::Manip, on both Unix and Win32. Once you do have the time zone data set correctly, save the file and close the editor.

    [6] You can learn more about the time zones defined in Date::Manip by reading section 5 of RFC 822 on the "Standard for ARPA Internet Text Messages" at http://www.faqs.org/rfcs/rfc822.html

    ################################ CONFIG VARIABLES###############################
    # See Date::Manip man page for a description of all config variables.
    EraseHolidays    =
    PersonalCnf      = Manip.cnf
    PersonalCnfPath  = t:.
    Language         = English
    DateFormat       = US
    TZ               = PST
    ConvTZ           = PST
    Internal         = 0
    FirstDay         = 1
    WorkWeekBeg      = 1
    WorkWeekEnd      = 5
    ...
  5. Date::Manip must be told where to find the configuration file. To do this, you'll need to edit the actual Manip.pm module file. Assuming that our Perl installation is under C:\Perl, this will be found as C:\Perl\lib\Date\Manip.pm.

  6. You won't be able to edit this file with Notepad.exe because the file lines end with an <LF> Unix-style line terminator instead of the expected MS-DOS <CR><LF> combination. Oh dear, you'll have to use vi!

  7. OK, we admit that you can use Wordpad.exe, as it's more resilient, but vi is still best. Wordpad.exe can be accessed from the Windows Start button via Start figs/U2192.gif Run, Wordpad. Open the file with File figs/U2192.gif Open and navigate to the C:\Perl\lib\Date\ directory before opening Manip.pm.

  8. We just need to edit the global configuration file location for Date::Manip. The appropriate directive can be found near the top of the Manip.pm file by searching for GlobalCnf. We've highlighted this line in the following example and set the value to c:/pdba/Manip.conf. Your own value may differ.

    ###########################################################################
    # CUSTOMIZATION
    ##########################################################################
    #
    # See the following POD documentation section CUSTOMIZING DATE::MANIP
    # for a complete description of each of these variables.
      
    # Location of the global config file.  Tilde (~) expansions are allowed.
    # This should be set in Date_Init arguments.
    $Cnf{"GlobalCnf"}="c:/pdba/Manip.conf";
    $Cnf{"IgnoreGlobalCnf"}="";
    ...
  9. We now need to save the file, and be sure to save it as a text file. Wordpad.exe has the disconcerting habit of adding a .txt file extension to any file when you do this, but save it that way anyway.

  10. We have to ensure that the file name is correct. The Manip.pm file has a read-only attribute, so change this before renaming the file. Open a command window and adapt the following commands, depending on what your edited file is called:

    C:> attrib -r C:\perl\site\lib\Date\Manip.pm
    C:> del C:\perl\site\lib\Date\Manip.pm
    C:> move C:\perl\site\lib\Date\Manip.pm.txt C:\perl\site\lib\Date\Manip.pm 
    C:> perl -cw C:\perl\site\lib\Date\Manip.pm

    Notice our use of the perl -cw switch. The -c switch causes Perl to check the syntax of a script without actually executing it, and the -w switch enables warnings that will catch many common errors. Using these switches ensures that the changes we made to the file are syntactically correct. Checking regularly when making changes to Perl files can save many headaches later down the road.

The Date::Manip Win32 configuration is now complete.

11.2.3 Using the PDBA::OPT and PDBA::PWC Modules

When we installed the PDBA Toolkit, it included a number of background modules, all of which are described in Chapter 9. These modules provide code that is used to support the operations of many of the Perl scripts in the toolkit. Two modules that are particularly important to the connectivity monitoring discussed in the following sections are the PDBA::OPT and PDBA::PWC modules. The purpose of the PDBA::OPT module is to scan the command line for options that may be intended for the password server; it then feeds the security information found on the command line to the PDBA::PWC module, which retrieves a password.

In the following sections, we'll delve into the guts of the toolkit in order to explain why you need PDBA::OPT and PDBA::PWC and how they work. Most readers won't need to know this information. But if you are interested in how we've put the toolkit together and may want to extend it some day, read on.

11.2.3.1 The password server

You may recall that back in Chapter 9 we set up the password server and experimented with the password client pwc.pl, which makes use of the PDBA::PWC module. This script can be used on the command line to retrieve passwords from the server, and the same interface is used in most of our toolkit scripts. Fortunately, the pwc.conf configuration file allows us to make use of the password server while keeping the number of command-line options to a minimum.[7] The password server is required if you wish to use the dbup.pl connectivity monitor. It's the PWD::OPT module that works behind the scenes to allow us to eliminate a great deal of related code. Here's how it works:

[7] The only options normally needed at the command line in order to use the password server are -machine, -database, and -username, which we'll cover soon.

  1. Getopt::Long is a standard Perl module used for parsing command line arguments. We need to configure the Getopt::Long module to allow extra command-line options, which it doesn't recognize by default. Here's an excerpt from dbup.pl with the new Getopt::Long configuration clearly shown:

    use Getopt::Long;
    our %optctl=(  );
      
    # passthrough allows additional command line options
    # to be passed to PDBA::OPT if needed
    Getopt::Long::Configure(qw{pass_through});
    GetOptions( \%optctl, "conf=s",   "debug!",
                          "kill!",    "mail!",
                          "daemon!",  "h|z|help" );

    The Getopt::Long::Configure(qw{pass_through}) statement tells Getopt::Long to ignore any extra command-line arguments it may soon see via @ARGV. These will be passed onto PDBA::OPT.

  2. The dbup.pl script loops through a hash data structure, %dbup::uptime, which lists the databases to which dbup.pl should attempt to connect. For each target database, it calls the PDBA::OPT module. This module, in turn, uses pwc.conf to help connect to the password server. Here's a simplified example of the call to PDBA::OPT:

    for my $db ( keys %dbup::uptime ) {
       my $password = 
          PDBA::OPT->pwcOptions ( INSTANCE => $db,
                                  MACHINE => $dbup::uptime{$db}->{machine},
                                  USERNAME => $dbup::uptime{$db}->{username});
    }

    What isn't apparent in this short example is the volume of hard work that PDBA::OPT is doing in the background. It looks up the pwc.conf configuration file, loads its entire contents, and then makes sure that any parameters input via the command line take precedence over those just found in pwc.conf. With the required information, PDBA::OPT then retrieves the necessary password from the password server.

We've reproduced the working portion of PDBA::OPT in Example 11-3. Configuration is unnecessary because it uses the password client setup we created in Chapter 9. It's important, though, to understand how PDBA::OPT makes use of command-line parameters to override values found in the pwc.conf file. We'll work through the important code lines after Example 11-3.

Example 11-3. PDBA::OPT
  1      package PDBA::OPT;
  2          
  3      $VERSION = '1.00';
  4          
  5      use strict;
  6      no strict 'vars';
  7          
  8      use Getopt::Long;
  9      use PDBA::ConfigFile;
 10     use Carp;
 11     %optctl = (  );
 12         
 13     sub pwcOptions {
 14         
 15        my $self = shift;
 16        my %args = @_;
 17         
 18        Getopt::Long::Configure(qw{pass_through});
 19         
 20        use PDBA::PWC;
 21         
 22        $optctl{pwc_conf} = 'pwc.conf';
 23         
 24        # specified directly on the command line
 25        GetOptions(\%optctl,
 26           "pwc_host=s",     # remote password server host
 27           "pwc_port=i",     # port to connect to
 28           "pwc_machine=s",  # database server
 29           "pwc_instance=s", # database instance 
 30           "pwc_username=s", # database username
 31           "pwc_conf=s",     # configuration file
 32           "pwc_key=s",         # encryption key
 33           "pwc_my_username=s", # your password server username
 34           "pwc_my_password=s", # your password server password
 35           "pwc_debug!"         # turn debug on
 36        );
 37         
 38        # overrides from the config file
 39        if ( exists( $optctl{pwc_conf} ) ) {
 40           use PDBA::ConfigFile;
 41           unless ( new PDBA::ConfigLoad( FILE => $optctl{pwc_conf} ) ) {
 42              croak "could not load config file  $optctl{pwc_conf}\n";
 43           }
 44         
 45           for my $key ( keys %pwc::optctl ) {
 46              $optctl{'pwc_' . $key} = $pwc::optctl{$key} 
 47                 unless exists $optctl{'pwc_' . $key};
 48           }
 49        }
 50         
 51        # overrides from args passed to pwcOptions
 52        # just a bunch of ifs
 53        if (defined($args{HOST})){ $optctl{pwc_host} = $args{HOST} }
 54        if (defined($args{PORT})){ $optctl{pwc_port} = $args{PORT} }
 55        if (defined($args{MACHINE}))
 56           { $optctl{pwc_machine} = $args{MACHINE} }
 57        if (defined($args{INSTANCE}))
 58           { $optctl{pwc_instance} = $args{INSTANCE} }
 59        if (defined($args{USERNAME}))
 60           { $optctl{pwc_username} = $args{USERNAME} }
 61        if (defined($args{CONF})){ $optctl{pwc_conf} = $args{CONF} }
 62        if (defined($args{KEY})){ $optctl{pwc_key} = $args{KEY} }
 63        if (defined($args{PWD_USERNAME}))
 64           { $optctl{pwc_my_username} = $args{PWD_USERNAME} }
 65        if (defined($args{PWD_PASSWORD}))
 66           { $optctl{pwc_my_password} = $args{PWD_PASSWORD} }
 67        if (defined($args{DEBUG})){ $optctl{pwc_debug} = $args{DEBUG} }
 68         
 69        if (
 70           ! defined( $optctl{pwc_host})
 71           || ! defined( $optctl{pwc_port})
 72           || ! defined( $optctl{pwc_machine})
 73           || ! defined( $optctl{pwc_instance})
 74           || ! defined( $optctl{pwc_username})
 75           || ! defined( $optctl{pwc_key})
 76           || ! defined( $optctl{pwc_my_username})
 77           || ! defined( $optctl{pwc_my_password})
 78        ) {
 79        croak qq/usage: $0 with PDBA::OPT
 80       --pwc_host <password server> 
 81       --pwc_port <tcp port>
 82       --pwc_machine <database server>
 83       --pwc_instance <database instance> 
 84       --pwc_username <database username> 
 85       --pwc_conf <configuration file - optional but recommended >
 86       --pwc_key <encryption key>
 87       --pwc_my_username <password server username>
 88       --pwc_my_password <password server password
 89     /;
 90        }
 91         
 92        my $remote_host=$optctl{pwc_host};
 93        my $remote_port=$optctl{pwc_port};
 94        my $machine=$optctl{pwc_machine};
 95        my $instance=$optctl{pwc_instance};
 96        my $username=$optctl{pwc_username};
 97        my $myusername=$optctl{pwc_my_username};
 98        my $mypassword=$optctl{pwc_my_password};
 99        my $key=$optctl{pwc_key};
100         
101        $optctl{pwc_debug} = 
102           exists $optctl{pwc_debug} ? $optctl{pwc_debug} : 0;
103         
104        my $client = new PDBA::PWC(
105           host => $remote_host,
106           port => $remote_port
107        );
108         
109        $client->authenticate(
110           username => $myusername,
111           password => $mypassword,
112           key => $key,
113           debug => $optctl{pwc_debug}
114        );
115         
116        # get response
117        my $password = $client->getPassword(
118           machine => $machine,
119           instance => $instance,
120           username => $username,
121           key => $key, 
122           debug => $optctl{pwc_debug} 
123        );
124         
125        return $password;
126     };
127     1;
11.2.3.2 Configuring Getopt::Long for pass-through mode

Let's look at the code and see what's going on here. In the example, at line 18, you'll see Getopt::Long configured into pass-through mode. This prevents it from complaining about unrecognized @ARGV options. Line 25 is the GetOptions call used to retrieve additional arguments from the command line, which may be used to override parameters in pwc.conf.

Let's consider an example to see how this works. The script my_script.pl relies on PDBA::PWC to retrieve passwords from the password server so a database connection can be made. A typical call to the script might look like this:

$ my_script.pl -machine sherlock -database ts01 -username scott

The script relies on the password server on Unix server watson to retrieve the password for user scott on database ts01. The ts01 database is itself housed on Unix server sherlock. The password server running on watson is normally transparent to the user. This is because the nitty-gritty details are hidden away in the pwc.conf configuration file, and PDBA::OPT is taking care of all of that for you.

What happens though, if the server watson is inaccessible for some reason? We still need to run our script, but because the password server is unavailable, we'll see an error like the code snippet here:

Uncaught exception from user code:
        Couldn't connect to watson:1579 : IO::Socket::INET: Timeout
...
        PDBA::OPT::pwcOptions('PDBA::OPT', 'INSTANCE', 'ts01', 'MACHINE',
'sherlock', 'USERNAME', 'scott') called at ./my_script.pl line 39

Further investigation reveals that the server itself is down and won't be up for another two hours. But we do know that an identically configured password server is also running on server mycroft as shown in Figure 11-5. Because PDBA::OPT allows us to override the parameters in pwc.conf with those stipulated on the command line, you rerun the command to execute my_script.pl, this time redirecting PDBA::OPT to connect to the password server on mycroft:

$ my_script.pl -machine sherlock -database ts01 
      -username scott -pwc_host mycroft

This successfully outputs:

GLOBAL_NAME: TS01.JKS.COM

The script my_script.pl is a simple one; all it does is retrieve the GLOBAL_NAME of the database from the system view GLOBAL_NAME. The important point, however, is that it succeeded in doing so.

Figure 11-5. Use of an alternative password server
figs/pdba_1105.gif
11.2.3.3 Loading the password client parameters

At line 41, in Example 11-3, the parameters from pwc.conf are loaded. These are used to load the %optctl hash with keys that begin with pwc_, in lines 45-48. This code will only set parameters that have not already been set with command line parameters.

In lines 53-67, the hash reference $args is checked for explicit overrides passed to PDBA::OPT by the caller. If any values are found, they replace the corresponding keys in the %optctl hash. For example, the following code fragment always directs PDBA::OPT to try to connect via the password server on host moriarty, regardless of what is contained in pwc.conf or directed to by the command-line option -pwc_host.

my $password = PDBA::OPT->pwcOptions (
      INSTANCE => $db,
      MACHINE => $dbup::uptime{$db}->{machine},
      USERNAME => $dbup::uptime{$db}->{username},
      HOST => 'moriarty' );

In lines 92-127 these final values are used to set up a session with the password server and retrieve the requested password.

Although we've provided quite a bit of detail here, you'll be relieved to hear that most Oracle DBAs never need to worry about this level of detail. As we mentioned earlier, we've exposed this implementation information for those who might want to modify our scripts or create your own. (We'll say more about doing that in Chapter 13.) Table 11-4 contains a listing of all command-line overrides available to PDBA::OPT.

Table 11-4. Command-line overrides for PDBA::OPT

Option

Description

-pwc_host

Password server host

-pcw_port

TCP connection port

-pcw_machine

Database server

-pcw_instance

Database instance

-pcw_username

Database username needing retrieved password

-pcw_conf

Password client configuration file

-pcw_key

Password server encryption key

-pcw_my_username

Password client username

-pcw_my_password

Password client password

-pcw_debug

Turns on debug code

11.2.4 Configuring dbup.pl and dbup_NT.pl

After our skirmish with the black box forces of PDBA::OPT, let's get back to the task at hand. We need to periodically check our database connectivity and get notified immediately if there's a problem. Fortunately, we have the technology. We'll discuss the Unix and Win32 configurations together because although the two scripts differ internally, they are configured identically. However, there is a bit of preparation that is platform-specific.

11.2.4.1 Win32 preparation

Before configuring dbup_NT.pl on Win32, you need to do the following:

  1. First, place a copy of the supplied dbup.conf config file into PDBA_HOME:

    C:> copy C:\perl\site\lib\PDBA\conf\dbup_NT.conf C:\pdba
  2. Now we have to install the dbup service. As luck would have it, you should find a pre-supplied Perl script, dbup_service.pl, available for just this purpose. Navigate to the directory where the PDBA utilities are living and run the install script:

    C:> dbup_service.pl -install
    Install
    Successfully added.
    Finished.
11.2.4.2 Unix preparation

Change to the directory where the PDBA Toolkit was installed. Make sure that the configuration file exists in PDBA_HOME if you've changed Unix users and it's not already there:

$ cd /u01/build/PDBA-1.00
$ cp dbup2/dbup.conf $PDBA_HOME
11.2.4.3 Configuration on both platforms

The only real difference between the two systems is that dbup_NT.conf has had each line terminated with <CR><LF>, enabling us to edit this config file with Notepad.exe. The first section of the file is displayed in Example 11-4.

Example 11-4. dbup.conf
# this line is required
package dbup;
use PDBA;
  
use vars qw( $dateFormat $ignoreFile
             %parms %uptime %addresses 
             %onCallList @supervisors );
  
# format of date in ignoreFile
$dateFormat = "%Y/%m/%d-%H:%M:%S";
$ignoreFile = 'dbignore.conf';

Here's what's happening in this code:

  1. The first code line establishes the Perl package name as dbup.

  2. The use vars statement prevents runtime warnings from occurring whenever a variable is referenced only once.

  3. $dateFormat and $ignoreFile are used later on to notify the monitor whenever you wish to remove a specific database from these regular connectivity checks until a specified date and time.

The rest of the module is then loaded. Example 11-5 shows some operational parameters that we may need to edit.

Example 11-5. Operational parameters -- dbup.conf
%parms = (
   mailServer => 'mail.yourdomain.com',
   fromAddress => 'oracle@yourdomain.com',
  
   # how often to check database connectivity
   # measured in seconds
   # low value used for testing
   # connectInterval => 30,
   connectInterval => 300,
   # hours are 0 - 23
   # these are the hours to page immediately without
   # retrying the connection.  outside of these hours,
   # retry a configurable number of times before paging
  
   hoursToPageImmediate => [6..18],
   # use a limited range for testing during the day
   #hoursToPageImmediate => [19..20],
   # don't page DBA during lunch. :)
   #hoursToPageImmediate => [6..11, 13..18],
  
   # how many times to retry a connection when the time
   # is outside the range of hoursToPage
   maxConnectRetries => 3,
   # *everything* is logged
   # will use PDBA_HOME
   logfile => PDBA->pdbaHome(  ) . q{/logs/dbup.log},
);

The following summarizes these parameters; some of them may be familiar from our earlier discussion of alert monitoring:

mailServer

You need to change the default value of mail.yourdomain.com to your local mail server. If your domain is acme.com, this might be mail.acme.com. Talk to your friendly sysadmin about this if you're unsure.

fromAddress

Used in the sent from part of the mail header for all mail dispatched from dbup. This needn't necessarily be a real email address. Continuing with the domain of acme.com, you could set this to oracle_dba@acme.com.

connectInterval

Determines how much time, in seconds, passes between database connection attempts. The default of 300 allows five minutes between attempts, though on Unix you may wish to alter this value. On Win32, we set it to a lower value, such as 180, which is explained in the following note. Keep in mind that when a database is down, we'll be emailed, or paged whenever a connection attempt fails. The lower the value for connectInterval, the more often this will happen until the situation is remedied.

Win32 services are stopped and started via the Service Manager. This presents a problem with dbup_NT.pl if the connectInterval parameter is set too high. The monitor runs in a loop, and at the end of each iteration a sleep call is made, for a period of $dbup::parms::connectInterval seconds. The dbup service fails to respond to the Service Manager's request for termination until the sleep call is completed. If connectInterval is set to 600, the service will ignore termination requests for 10 minutes. We therefore recommend a lower value, such as 180.

The next two parameters work in concert, determining if the on-call DBA will be paged immediately or if the page call will be deferred:

hoursToPageImmediate

In Example 11-5, hoursToPageImmediate is set to a value of [6..18]. These numbers refer to a 24-hour clock. From 6:00 AM, until 6:59 PM (18:59), the DBA is paged immediately if a connection error occurs.

maxConnectRetries

If a connection error does occur outside of the hoursToPageImmediate time frame, dbup makes maxConnectRetries reconnect attempts before paging the DBA. We particularly appreciate this feature when servers are rebooted in the middle of the night after a minor hardware glitch, and paging is avoided because the target server and all its databases are back up within a few minutes. All failures-to-connect are still recorded in the log, however, and an email is sent regardless of whether the on-call DBA was paged.

Let's thank the two Larrys for small mercies. Because this is Perl, we're allowed enormous flexibility in defining the hoursToPageImmediate hourly range. You may have noticed a commented-out section in Example 11-5, where we'd previously specified the DBA paging hours as [6..11, 13..18]. If you feel that you deserve a two-hour lunch window between 11:00 AM and 1:00 PM, when the dbup secretary should place all calls on hold, the opportunity is there!

logfile

Full path file name for the log of all dbup operations. The default displayed in the example should work fine as is, unless you wish to place the log somewhere else. Because the config files are executed as code in the PDBA::ConfigLoad module, you can use the PDBA->pdbaHome method to place the file, as shown in Example 11-5.

11.2.4.4 Examining uptime requirements

Let's examine Example 11-6 to determine the database uptime requirements.

Example 11-6. Database uptime requirements -- dbup.conf
%uptime = (
   ts01 => {
      machine => 'sherlock',
      upDays => [0..6],
      upHours => [0..23],
      username => 'system',
      alertLevel => 3 },
   ts20 => {
      machine => 'mycroft',
      upDays => [0..6],
      upHours => [0..23],
      username => 'system',
      alertLevel => 3 },
   ts99 => {
      machine => 'watson',
      upDays => [1..5],
      upHours => [6..18],
      username => 'system',
      alertLevel => 2 },
);

The %uptime hash specifies five parameters for each database needing testing:

machine

Name of the server hosting the database.

upDays

This value range specifies the weekdays that the database is in live production mode. The values are 0..6, which means from Sunday through Saturday. Let's look at a few examples; see the following section for more details:

  • A value of [0..6] means that the database is required 7 days a week.

  • A value of [1..5] means that the database is required Monday through Friday.

  • A value of [1..3, 5,6] tells dbup that this database needs to be up Monday through Wednesday, and then Friday through Saturday.

upHours

Just as you can specify weekday requirements, you can also set the daily hours of required uptime. Let's look at a few examples; see the following section for more details:

  • A value of [0..23] indicates a database that's required 24 hours a day.

  • A value of [8..17] means that the database is only required to be up from 8:00 AM until 6:00 PM; this sets the time of operation from 8:00 AM to 5:59 PM (the granularity of this parameter is 1 hour, and at 6:00 PM the hour becomes 18:00). In practice we'd set these parameters to [6..18] (see the next section).

username

Database account used to attempt the database connection.

alertLevel

A strictly informational field. When the on-call DBA is paged, it includes a line detailing alertLevel. For example, we use alert levels of 1 through 3, with 3 being Ace-high. We'll see an email like this on our cell phones:

2002/12/29 - 03:45 DB Down: ts20
From: oracle@jks.com
To: dba@yourcompany.com
Failed to connect to database ts20 at 2002/12/29 - 03:45
 Alert Level: 3

If we see this one, we know that we'd better stop hugging that pillow, get out of bed, and take care of that database right away!

11.2.4.5 Looking at upDays and upHours parameters

Before moving on, let's take a moment to discuss how DBA paging is impacted by the upDays and upHours parameters. If dbup.pl or dbup_NT.pl is unable to connect to a database outside of the time specified by these parameters, the on-call DBA avoids being paged. This is useful for systems without 24x7 uptime requirements.

Let's see what the best parameter settings would be for this type of system. Given a database requiring uptime from Monday through Friday and from 8:00 AM until 6:00 PM each day, we'd set the upHours parameter to [6..18]. This specification has the effect of notifying the on-call DBA of any problems at 6:00 AM, allowing time to rectify those problems before 8:00 AM when users will typically expect the database to be available. This also allows an extra hour after the 6:00 PM end of uptime. At this point, the DBA can correct any errors that might have occurred during that dangerous time of day when machine operators are ending their shifts (and possibly rushing to go home and introducing errors in the process).

11.2.4.6 Setting up pager and email addresses

Example 11-7 shows the section of the configuration file that is used to set up the pager and email addresses for the DBA Manager, the Operations Manager, and the DBAs themselves.

Example 11-7. Email addresses — dbup.conf
%addresses = (
  
   'dbamgr' => {
      pager => '7775551212@mobile.att.net',
      emailWork => 'atwork@yourdomain.com',
      emailHome => 'athome@yourdomain.com' },
  
   'opsmgr' => {
      pager => '7775551212@mobile.att.net',
      emailWork => 'atwork@yourdomain.com',
      emailHome => 'athome@yourdomain.com' },
  
   'dba_1' => {
      pager => '7775551212@mobile.att.net',
      emailWork => 'atwork@yourdomain.com',
      emailHome => 'athome@yourdomain.com' },
   'dba_2' => {
      pager => '7775551212@mobile.att.net',
      emailWork => 'atwork@yourdomain.com',
      emailHome => 'athome@yourdomain.com' }
);
  
# these need to be the names of standard address
# entries in the %addresses hash.
# if these exist, supervisors will be paged at
# all times a DBA is paged, and will be sent
# all mail sent to the oncall DBA
# if you don't want this feature to be enabled,
# set it to an empty list.
# e.g. 
#   @supervisors=(  );
# or just comment it out.
@supervisors = ('opsmgr','dbamgr');

Note the following guidelines:

  • There should be a single entry for each of the dbamgr and opsmgr entries.

  • The DBA entries may be repeated for as many DBAs as you want to include.

  • If you don't want the Operations or DBA Managers to be paged in the event of database failure, just leave those entries as they are. We'll demonstrate shortly how to disable them.

Each entry has three parameters — one for a pager and two more for emails.

pager

Email address for a pager or cell phone. This must be a valid email address — simply using a phone number will not work, as all communications from dbup are done via email. Many cellular phone companies offer an email address for cellular phones, similar to those shown in Example 11-7.

emailWork

Valid email address for the nominated person's workplace.

emailHome

Valid email address for the nominated person's home email.

Be aware of the following rules:

  • The pager parameter is strictly unnecessary. If you wish to disable it, just set the value to an empty pair of quotes:

    pager => '',
  • The emailWork and emailHome parameters are required. Set them both to the same email address if necessary.

The last entry in the %addresses section is the @supervisors array. This determines which extra entries will be paged and emailed, in addition to the on-call DBA, in case of database failure. You disable this feature by commenting out the line with a hash # character or by creating an empty list:

@supervisors=(  );
11.2.4.7 Who's on Third?

The last thing we need to do before putting our connectivity monitor into action is to determine which DBA is on-call during any given week. Example 11-8 contains the start and the end of the default entries you'll find in the %onCallList hash, in the dbup configuration file. These abbreviated entries cover the time period from October 3, 2001, through October 2, 2002; each date is a Wednesday. If your schedule requires a different day as the switchover day of the on-call week, you need to edit these dates. Be sure to retain the YYYYMMDD format as shown.

Example 11-8. DBA on-call schedule — dbup.conf
%onCallList = (
   '20011003' => 'dba_1',
   '20011010' => 'dba_2',
   '20011017' => 'dba_1',
   '20011024' => 'dba_2',
   ...
   '20020911' => 'dba_2',
   '20020918' => 'dba_1',
   '20020925' => 'dba_2',
   '20021002' => 'dba_1',
   'default' => 'dba_1' );

Note the following about this example:

  1. Simply change our dba_1 and dba_2 values to those corresponding to the DBAs specified within the section shown in Example 11-7.

  2. The final entry in the %onCallList hash is default. This is the DBA (or manager) who is paged when dates fall outside the date range specified.

  3. There is some latitude possible when setting the %onCallList hash. If you use a single physical pager for the on-call DBA, and pass it around among those in the DBA group, you can delete all of the dated entries and leave just the default entry. This entry will thus always be the one paged in the case of a database failure. Even though only one DBA entry in the configuration file will now be paged, email will still be sent to all entries in the %addresses hash. Example 11-9 is an example of just such a setup.

Example 11-9. Configuring for a single pager — dbup.conf
%addresses = (
   'dbamgr' => { pager => '7775551212@mobile.att.net',
                 emailWork => 'atwork@yourdomain.com',
                 emailHome => 'athome@yourdomain.com' },
  
   'dba_1' => { pager => '7775551212@mobile.att.net',
                emailWork => 'atwork@yourdomain.com',
                emailHome => 'athome@yourdomain.com' },
  
   'dba_2' => { pager => '7775551212@mobile.att.net',
                emailWork => 'atwork@yourdomain.com',
                emailHome => 'athome@yourdomain.com' },
   
   'dbaPager' = > { pager => '7775551212@mobile.att.net',
                    emailWork => 'oracle@yourdomain.com',
                    emailHome => 'oracle@yourdomain.com' }
);
@supervisors = ('dbamgr');
%onCallList = ('default' => 'dbaPager');

Now that we're finished editing the dbup configuration file, we can check it for syntax. On Unix, specify:

$ perl -cw $PDBA_HOME/dbup.conf

On Win32 you'll need to specify the full path to the file:

C:\> perl -cw c:\pdba\dbup_NT.conf

If the syntax is valid, Perl responds with dbup.conf syntax OK (or a similar message). If invalid, you need to correct the error in your configuration file before continuing.

11.2.5 Running the Connectivity Monitor

We're now ready to run the fully loaded dbup monitor. For Unix, simply enter the following at the command line:

$ dbup.pl -daemon

For Win32, we'll need to start the Windows Service Manager application and then the Oracle_dbup_Monitor service, as shown earlier for chkalert_NT.pl.

11.2.5.1 Testing the monitor

Now we'll actually run the dbup monitor under varying conditions and examine the logfile output. The tests shown in this section were run under a Windows 2000 system. The configuration file we'll use is the one shown in Example 11-10. The only changes made for testing purposes were the substitution of real phone numbers and email addresses. Of the three databases to be checked, one of them, ts20, will be unavailable. We'll examine the log entries on a standard 24x7 schedule and then modify the entry for ts20 to ensure that the required uptime is Monday through Friday. We'll then look again at the log entries.

Example 11-10. Test configuration — dbup.conf
package dbup;
use PDBA;
           fromAddress => 'oracle@jks.com',
           connectionTimeout => 10,
           connectInterval => 120,
           hoursToPageImmediate => [0..23],
           maxConnectRetries => 3,
           logfile => PDBA->pdbaHome(  ) . q{/logs/dbup.log} );
  
%uptime = (
   ts01 => { machine => 'sherlock',
             upDays => [0..6],
             upHours => [0..23],
             username => 'system',
             alertLevel => 3 },
  
   ts20 => { machine => 'mycroft',
             upDays => [0..6],
             upHours => [0..23],
             username => 'system',
             alertLevel => 3 },
  
   ts99 => { machine => 'watson',
             upDays => [0..6],
             upHours => [0..23],
             username => 'system',
             alertLevel => 2 },
);
  
%addresses = (
   'dbamgr' => { pager => '7775551212@mobile.att.net',
                 emailWork => 'myboss@thecompanycom',
                 emailHome => 'myboss@herhome.com' },
  
   'jkstill' => { pager => '7775551213@mobile.att.net',
                  emailWork => 'jkstill@somewhere.com',
                  emailHome => 'jkstill@somewhere.com' },
  
   'andyd' => { pager => '7775551214@mobile.att.net',
                emailWork => 'andyd@somewhere.com',
                emailHome => 'andyd@somewhere.com' }
);
  
#@supervisors = ('opsmgr','dbamgr');
  
%onCallList = ( '20011107' => 'andyd',
                '20011114' => 'jkstill',
                '20011121' => 'andyd',
                '20011128' => 'jkstill',
                '20011205' => 'andyd',
                '20011212' => 'jkstill',
                '20011219' => 'andyd',
                '20011226' => 'jkstill',
                'default' => 'jkstill' );
1;

We let the Oracle_dbup_Monitor service run for about five minutes before stopping it. Example 11-11 displays the contents of the log file.

Example 11-11. dbup test #1
20011209163103:Service Starting - State is: 2
20011209163103:Service Started - State is: 4
20011209163103:Service running
20011209163103:Main Loop
20011209163104:Check database: ts01
20011209163105:Connection to ts01 successful
20011209163105:Check database: ts99
20011209163105:Connection to ts99 successful
20011209163105:Check database: ts20
20011209163106:Database ts20 down during required uptime
20011209163106:On call DBA is: andyd
20011209163106:Sent email to andyd@somewhere.com, jkstill@somewhere.com 
20011209163106:Database ts20 is down - paging DBA: andyd
20011209163107:Sent page to 7775551214@mobile.att.net
20011209163307:Service running
20011209163307:Main Loop
20011209163308:Check database: ts01
20011209163308:Connection to ts01 successful
20011209163308:Check database: ts99
20011209163308:Connection to ts99 successful
20011209163308:Check database: ts20
20011209163309:Database ts20 down during required uptime
20011209163309:On call DBA is: andyd
20011209163309:Sent email to andyd@somewhere.com, jkstill@somewhere.com 
20011209163309:Database ts20 is down - paging DBA: andyd
20011209163310:Sent page to 7775551214@mobile.att.net
20011209163510:Stopping Service

Let's examine a few of the high points found in the logfile.

  1. At 04:31:06 PM, on December 9, 2001, dbup found that ts20 was unavailable.

  2. Because the dbup_NT.conf file determined that this database was on a 24x7 schedule, the on-call DBA was paged at 04:31:07 PM.

  3. Two minutes later, the database was still unavailable, so the on-call DBA was again paged at 04:33:10 PM.

  4. At this point, the Oracle_dbup_Monitor service was stopped.

We then changed the following dbup_NT.conf file, indicating that we only required the ts20 database to be up from Monday through Friday. The Oracle_dbup_Monitor service was restarted, allowed to run for five minutes, and then stopped:

ts20 => { machine => 'mycroft',
          upDays => [1..5],
          upHours => [6..18],
          username => 'system',
          alertLevel => 3 },

The results of this change are seen in Example 11-12.

Example 11-12. dbup test #2
20011209164456:Service Starting - State is: 
220011209164456:Service Started - State is: 
420011209164456:Service running20011209164456:Main Loop20011209164457:Check database: 
ts0120011209164458:Connection to ts01 successful
20011209164458:Check database: ts99
20011209164458:Connection to ts99 successful
20011209164458:Check database: ts20
20011209164459:Database ts20 down during off hours
20011209164459:On call DBA is: andyd
20011209164500:Sent email to andyd@somewhere.com, jkstill@somewhere.com
20011209164700:Service running
20011209164700:Main Loop
20011209164701:Check database: ts01
20011209164701:Connection to ts01 successful
20011209164701:Check database: ts99
20011209164701:Connection to ts99 successful
20011209164701:Check database: ts20
20011209164702:Database ts20 down during off hours
20011209164702:On call DBA is: andyd
20011209164703:Sent email to andyd@somewhere.com, jkstill@somewhere.com
20011209164903:Stopping Service

Please note the following highlights from Example 11-12:

  1. At 04:44:59 PM, and again at 04:47:02 PM, ts20 was unavailable.

  2. Because this occurred outside of the required uptime for this database, the on-call DBA was emailed, but not paged.

  3. If ts20 remains unavailable after 06:00 AM on one day of required uptime, the on-call DBA is paged.

We've found this utility to be very useful, and we hope you do too. The ability to catch database connectivity problems before they have an effect on the users of the live database goes a long way toward maintaining good customer relations.

11.2.5.2 Command-line options

Only a few command-line options are available for dbup.pl and dbup_NT.pl; they are summarized in Table 11-5.

If you change an option for Win32, you must remove the Oracle_dbup_Monitor service via dbup_service.pl, edit dbup_service.pl, and then re-install the service at the command line to put the changes into effect. The reason for this is that the monitor is started and stopped via the Win32 Service Manager, and any command-line arguments to the dbup monitor are stored in the Win32 Registry. You may take comfort in the fact that the defaults are probably fine for most systems.

Table 11-5. Command-line options — dbup.pl and dbup_NT.pl

Option

Description

-conf

Name of the configuration file. This defaults to dbup.conf. You do not need to specify a full path locator.

-daemon

On Unix, causes dbup.pl to run in the background as a daemon. This option is not available for dbup_NT.pl on Win32.

-debug

On Unix, causes additional information to be placed in the logfile. If you are running in console mode, informational debug messages will be printed to the terminal. On Win32 systems, this option causes additional information to be placed in the logfile only.

-mail

Causes mail to be sent to the DBAs.

-nomail

Prevents email from being sent to the DBAs. (The default between -mail and -nomail is -mail.)

-help

Prints a help message on the terminal console.

    Team LiB   Previous Section   Next Section