Monitoring ARC Compute Elements

General Configuration

For each CE to monitor run

check_arcce_submit -H <HOST>

This should be run at a relatively low frequency in order to let one job finish before the next is submitted. The probe keeps track of submitted jobs, and will hold the next submission if necessary. Subsequent sections describe additional options for testing data-staging, running custom scripts, etc.

On a more regular basis, each 5 min or so, run

check_arcce_monitor

which will monitor all job status of each host and submit it passively to a service matching the host name and the service description “ARCCE Job Termination”. The passive service name can be configured.

Finally, a probe is provided to tidy the ARC job list after unsuccessful attempts by check_arcce_monitor to clean jobs. This is also set up as a single service, and only needs to run occasionally, like once a day:

check_arcce_clean

For additional options, see

check_arcce_submit --help
check_arcce_monitor --help
check_arcce_clean --help

Plugin Configuration

The main configuration section for this probe is arcce, see Configuration Files. This probe requires an X509 proxy, see Proxy Certificate.

Connection URLs for job submission (the --ce option) may be specified in the section arcce.connection_urls.

Example:

[arcce]
voms = ops
user_cert = /etc/nagios/globus/robot-cert.pem
user_key = /etc/nagios/globus/robot-key.pem
loglevel = DEBUG

[arcce.connection_urls]
arc1.example.org = ARC1:https://arc1.example.org:443/ce-service
arc0.example.org = ARC0:arc0.example.org:2135/nordugrid-cluster-name=arc0.example.org,Mds-Vo-name=local,o=grid

The user_key and user_cert options may be better placed in the common gridproxy section.

Nagios Configuration

You will need command definitions for monitoring and submission:

define command {
    command_name check_arcce_monitor
    command_line $USER1$/check_arcce_monitor -H $HOSTNAME$
}
define command {
    command_name check_arcce_clean
    command_line $USER1$/check_arcce_clean -H $HOSTNAME$
}
define command {
    command_name check_arcce_submit
    command_line $USER1$/check_arcce_submit -H $HOSTNAME$ \
                    [--test <test_name> ...]
}

For monitoring, add a single service like

define service {
    use                     monitoring-service
    host_name               localhost
    service_description     ARCCE Monitoring
    check_command           check_arcce_monitor
}
define service {
    use                     monitoring-service
    host_name               localhost
    service_description     ARCCE Cleaner
    check_command           check_arcce_clean
    normal_check_interval   1440
    retry_check_interval    120
}

For each host, add something like

define service {
    use                     submission-service
    host_name               arc0.example.org
    service_description     ARCCE Job Submission
    check_command           check_arcce_submit
}
define service {
    use                     passive-service
    host_name               arc0.example.org
    service_description     ARCCE Job Termination
    check_command           check_passive
}

The --test <test_name> options enables tests to run in addition to a plain job submission. They are specified in individual sections of the configuration files as described below. Such a test may optionally submit the results to a named passive service instead of the above termination service. To do so, add the Nagios configuration for the service and duplicate the “service_description” in the section defining the test.

See the arcce-example.cfg for a more complete Nagios configuration.

Running Multiple Job Services on the Same Host

By default, running jobs are tracked on a per-host basis. To define multiple job submission services for the same host, pass to --job-tag a tag which identify the service uniquely on this host. Remember to also add a passive service and pass the corresponding --termination-service option.

The scheme for configuring an auxiliary submission/termination service is:

define command {
    command_name check_arcce_submit_<test_name>
    command_line $USER1$/check_arcce_submit -H $HOSTNAME$ \
        --job-tag <test_name> \
        --termination-service 'ARCCE Job Termination for <Test-Description>' \
        [--test <test1> ...]
}
define service {
    use                     submission-service
    host_name               arc0.example.org
    service_description     ARCCE Job Submission for <Test-Description>
    check_command           check_arcce_submit_<test_name>
}
define service {
    use                     passive-service
    host_name               arc0.example.org
    service_description     ARCCE Job Termination for <Test-Description>
    check_command           check_passive
}

Custom Job Descriptions

If the generated job scripts and job descriptions are not sufficient, you can provide hand-written ones by passing the --job-description option to the check_arcce_submit command. This option is incompatible with --test.

Currently no substitutions are done in the job description file, other than what may be provided by ARC.

Job Tests

Scripted Checks

It is possible to add custom commands to the job scripts and do a regular expression match on the output. E.g. to test that Python is installed and report the version, add the following section to the plugin configuration file:

[arcce.python]
jobplugin = scripted
required_programs = python
script_line = python -V >python.out 2>&1
output_file = python.out
output_pattern = Python\s+(?P<version>\S+)
status_ok = Found Python version %(version)s.
status_critical = Python version not found in output.
service_description = ARCCE Python version

The options are

required_programs
Space-separated list of programs to check for before running the script. If one of the programs is not found, it’s reported as a critical error.
script_line
One-liner shell code to run, including features commonly supported by /bin/sh on year CEs.
output_file
The name of the file your script produces. This is mandatory, and the same file will be used to communicate errors back to check_arcce_monitor. The reason standard output is not used, is to allow multiple job tests to publish independent passive results.
output_pattern
This is a Python regular expression which is searched for in the output of the script. It will stop on the first matched line. You cannot match more than one line, so distill the output in script_line if necessary. Named regular expression groups of the form (?<v>...) captures their output in a variable v, which can be substituted in the status messages.
status_ok
The status message if the above regular expression matches. A named regular expression group captured in a variable v can be substituted with %(v)s.
status_critical
Status message if the regular expression does not match. Obviously you cannot do substitutions of RE groups. If the test for required programs fail, then the status message will indicate which programs are missing instead.
service_description
The service_description of the passive Nagios service to which results are reported.

See Probe Configuration for more examples.

It is possible to give more control over the probe status to the remote script. Instead of output_pattern the script may pass status messages and an exit code back to Nagios. This is done by printing certain magic strings to the file specified by output_file:

  • __status <status-code> <status-message> sets the exit code and status line of the probe.
  • __log <level> <message> emits an additional status line which will de shown iff the log level set in the probe configuration is at least <level>, which is a numeric value from the Python logging module.
  • __exit <exit-code> is used to report the exit code of a script. Anything other than 0 will cause a CRITICAL status. You probably don’t want to use this yourself.

The __status line may occur before, between, or after __log lines. This can be convenient to log detailed check results and issues before the final status in known.

It possible to adapt this to a Nagios-style probe check_foo by wrapping it in some shell code:

script_line = (/bin/sh check_foo 2>&1; echo __status $?) | \
    (read msg; sed -e 's/^/__log 20 /' -e '$s;^__log 20 \(.*\);\1 '"$msg;") \
    > check_foo.out
output_file = check_foo.out
staged_inputs = file:////path-to/check_foo

Staging Checks

The “staging” job plug-in checks that file staging works in connection with job submission. It is enabled with --test <test-name> where the plugin configuration file contains a corresponding section:

[arcce.<test-name>]
jobplugin = staging
staged_inputs = <URL> ... <URL>
staged_outputs = <URL> ... <URL>
service_description = <TARGET-FOR-PASSIVE-RESULT>

Note that the URLs are space-separated. They can be placed separate indented lines. Within the URLs, the following substitutions may be useful:

%(hostname)s
The argument to the -H option if passed to the probe, else “localhost”.
%(epoch_time)s
The integer number of seconds since Epoch.

If a staging check fails, the whole job will fail, so it’s status cannot be submitted to an individual passive service as with scripted checks. For this reason, it may be preferable to create one or more individual submission services dedicated to file staging. Remember to pass unique names to --job-tag to isolate them.

Custom Substitutions in Job Test Sections

In job test sections you can use substitutions of the form %(<var-name>)s, where <var-name> is defined in a separate section as described as follows. Variable definitions can themselves contain substitutions of this kind. Cyclic definitions are detected and reported as UNKNOWN.

Probe Option. A section of the form

[variable.<var>]
method = option
default = <default-value>

declares <var> as an option which can be passed to the probe with -O <var>=<value>. The default field may be omitted, in which case the probe option becomes mandatory for any tests using the variable.

UNIX Environment. A section of the following form declares that <var> shall be imported from the UNIX environment. If no default value is provided, then the environment variable must be exported to the probe.

[variable.<var>]
method = getenv
envvar = <VARIBLE>

The envvar line optionally specifies the name of the variable to look up, which otherwise defaults to <var>.

Pipe Output. The following allows you to capture the output of a shell command:

[variable.<var>]
method = pipe
command = <command-line>

Custom Time Stamp. This method provides a custom time stamp format as an alternative to %(epoch_time)s. It takes the form

[variable.<var>]
method = strftime
format = <escaped-strftime-style-format>

Note that the % characters in the format field must be escaped as %%, as to avoid attempts to parse them as interpolations. An alternative raw_format field can be used, which is interpreted literally.

Random Line from File. A section of the following form picks a random line from <path>. A low entropy system source is used for seeding.

[variable.<var>]
method = random_line
input_file = <path>
exclude = <optional-space-separated-list>

Leading and trailing spaces are trimmed, empty lines and lines starting with a # character are ignored. If provided, any lines matching one of the space-separated words in exclude are ignored, as well.

Switch. If you need to set a variable on a case to case basis, the form is

[variable.<var>]
method = switch
index = <index-value>
case[<index-1>] = <value-1>
# ...
case[<index-n>] = <value-n>
default = <default-value>

This will first expand “<index-value>”. If this matches “<index-<i>>” for some “<i>”, then the expansion of <value-<i>> is returned, otherwise <default-value>. See also the example below.

LDAP Search. A value can be extracted from an LDAP attribute using

[variable.<var>]
method = ldap
uri = <ldap-uri>( <ldap-uri>)*
filter = <ldap-filter>
attribute = <ldap-attribute>
default = <optional-default-value>

If multiple records are returned, the first returned record which provides a value for the requested attribute is used. If the attribute has multiple values, the first returned value is used. Note that the LDAP server may not guarantee stable ordering.

Example. In the following staging tests, %(se_host)s is replaced by a random host name from the file /var/lib/gridprobes/ops/goodses.conf, and %(now)s is replaced by a customized time stamp.

[arcce.srm]
jobplugin = staging
staged_output = srm://%(se_host)s/%(se_dir)s/%(hostname)s-%(now)s.txt
service_description = Test Service

[variable.se_host]
method = random_line
input_file = /var/lib/gridprobes/ops/goodses.conf

[variable.now]
method = strftime
raw_format = %FT%T

[variable.se_dir]
method = switch
index = %(se_host)s
case[se-1.example.org] = /pnfs/se-1.example.com/nagios-testfiles
case[se-2.example.org] = /dpm/se-2.example.com/home/nagios-testfiles
default = /nagios-testfiles