Monitoring the ARC Information System

The main configuration section for these probes is arcinfosys, see Configuration Files.

EGIIS Check

To monitor an EGIIS service, use

check_egiis -H <HOST> [-P <PORT>] --index=<INDEX-NAME>

This will do an LDAP query of the EGIIS service on <HOST>:<PORT>. The default port is 2135. The base DN of the query is Mds-Vo-name=<INDEX-NAME>, o=grid. The probe will also fetch the subschema at cn=subschema and check the presence of attributes against MAY and MUST specifications in the schema. In addition some type conversions are attempted to catch invalid data.

Any validation error will give a CRITICAL Nagios status. If the index is empty, a WARNING Nagios status is reported. Otherwise, the status is OK and counts for different registrations states is printed.

CE Health State using EMIES

The following probe contacts the EMIES service of the compute element and checks the HealtStatus element in the reply.

check_arcservice -u <url> [-k <key-file> -c <cert-file>] [-t <timeout>]

arcinfo -c <host> shows whether a CE supports EMIES and which URL to use. EMIES uses SSL client authentication. By default the host certificate is used. To use a grid proxy, pass it as both key and certificate. Example:

check_arcservice -u https://arcce.example.org:443/arex
-k /tmp/x509up_1000 -c /tmp/x509up_1000

CE Infosys Validation for the GLUE 2 LDAP Schema

You can test the GLUE 2 LDAP records published by an CE with

check_arcglue2 -H <HOST> [-P <PORT>] \
        [--glue2-schema PATH] [--if-dependent-schema STATUS] \
        [--warn-if-missing OBJECTCLASS,...,OBJECTCLASS] \
        [--critical-if-missing OBJECTCLASS,...,OBJECTCLASS] \
        [--hierarchical-foreign-keys FOREIGN-KEY,...,FOREIGN-KEY] \
        [--hierarchical-aggregates]

See check_arcglue2 --help for a full list of options.

This probe will do a full query under o=glue on the provided host and port and perform the following checks. The default port is 2135.

As a basic check that the information system contains data, --warn-if-missing and --critical-if-missing may be passed a comma-separated list of LDAP objectclasses for which there should be at least one entry in the information system. By default, a warning is raised if the system has no entries of type GLUE2AdminDomain, GLUE2Service, or GLUE2Endpoint.

The probe will verify each entry using the GLUE 2 LDAP schema. By default, the GLUE 2 schema is expected at /etc/ldap/schema/GLUE20.schema. An alternative path may be specified with the --glue2-schema option. If the schema is not found, a warning is raised and the schema is fetched from cn=subschema. The rationale behind this warning is that the content should be checked independent of what the remote end claims it should be. Another Nagios status can be specified with --if-dependent-schema, including OK to disable the warning.

As GLUE 2 is relational in nature, the probe does further checks on connections which cannot be specified in the LDAP schema. It checks uniqueness of the *ID attributes, and the outgoing and incoming multiplicities of *ForeignKey attributes as specified in the GLUE Specification v2.0 [GLUE2] and the LDAP schema reference implementation [GLUE2L].

Further, the probe checks some of the constraints on the directory information tree (DIT) [GLUE2L]. A critical condition is raised if the following conditions are not met.

  • All GLUE2Extension objects must appear immediately below the object they extend.
  • Objects which are aggregates of a GLUE2Service must appear somewhere below that service.
  • Services which link to a GLUE2AdminDomains cannot reside under a different domain.

Optionally you can require the DIT to reflect additional foreign keys, either passing an explicit list to --hierarchical-foreign-keys, or passing --hierarchical-aggregates to include all keys which represent aggregation or composition. Note that the latter will fail unless services are structured under their administrative domain, if any.

CE Infosys Validation for the NorduGrid and GLUE 1 Schemas

The ARIS probe is invoked with

check_aris -H <HOST> [-P <PORT>] [--cluster <CLUSTER>...] \
        [--cluster-test <testname>...] [--queue-test <testname>...] \
        [OTHER-OPTIONS...]

See check_aris --help for the full list of options. It will query Mds-Vo-name=local, o=grid on <HOST>:<PORT>. The default port is 2135. If one or more clusters are specified with the --cluster option, only those will be checked (nordugrid-cluster-name=<CLUSTER>), and it is considered error for any of them to be missing. The probe validates attributes of entries against MAY and MUST of the schema, and attempts some type conversions. For each found cluster, the probe will query and validate queues.

If no clusters are found, or if no queues are found for a given cluster, it will be reported as a warning. You can change this by passing a Nagios status to the option --if-no-clusters or --if-no-queues, respectively. Valid statuses are ok, warning, critical, and unknown, though only the first three makes sense here.

This probe can also do custom checks on the LDAP data, either numeric limits or regular-expression matches. A custom test defined in the configuration file under a section arcinfosys.aris.<testname>, can be enabled by passing any number of --cluster-test <testname> and --queue-test <testname> options to the probe. The tests are run on entries of the type nordugrid-cluster and nordugrid-queue, respectively.

The ARIS infosystem contains a attribute nordugrid-cluster-contactstring which provides the interface for job submission. You can check that this URL is accessible by passing --check-contact. This will do a list operation and, if the logging level is INFO or lower, will report the number of entries. If the attribute is missing or the URL is inaccessible, the service goes CRITICAL with an appropriate message.

Limit Checks

A limit check takes the form

[arcinfosys.aris.<testname>]
type = limit
value = <expr>
critical.min = <value>
critical.max = <value>
critical.message = <message>
warning.min = <value>
warning.max = <value>
warning.message = <message>

The type and value variables are required, and at least one of the min or one of the max variables should be given for the test to be useful. There are reasonable defaults for the messages, though if your <expr> is complex, you may want to provide a more human readable version. The probe will

  • Evaluate <expr> using Python’s eval function, in an environment based on the LDAP attribute names to the corresponding converted values. The variable names are obtained from the attribute names by replacing “-” with “_” and stripping common prefixes including “nordugrid-cluster-”, “nordugrid-queue-”, and “Mds-”.
  • If critical.min is given and the result is below this value, or if critical.max is given and the result is above this value, report it as a critical error.
  • Similar for warning.min and warning.max, reported as a warning.

Regular Expression Checks

A regular expression check takes the form:

[arcinfosys.aris.<testname>]
type = regex
variable = <varname>
critical.pattern = <python-regex>
critical.message = <message>
warning.pattern = <python-regex>
warning.message = <message>

The type and variable settings are required, and you should specify at least on of critical.pattern and warning.pattern. The variable name is obtained the same way as for the limit checks. The probe will consider all values for the LDAP attribute corresponding to <varname>.

  • If critical.pattern is specified and none of the values match it, then a critical condition is reported, else
  • if warning.pattern is specified and none of the values match it, then a warning is reported.

The following example will issue a critical state if a queue is not active:

[arcinfosys.aris.queue-active]
type = regex
variable = status
critical.pattern = ^active$
critical.message = Inactive queue

Glue Schema Checks

Some CEs publish cluster and queue information in the Glue schema in addition to the NorduGrid schema. You can enable schema checks for these if present by passing --enable-glue.

The information in the Glue entries should match information in the ARC entries as described in [ARCIS2011]. You can enable a partial comparison of GlueCE, GlueCluster, and GlueSubCluster records by passing --compare-glue.

Checking Expiration of Host Certificates

A separate probe is provided for checking the host certificate as reported by the information system:

check_archostcert -H <HOST> [-p <PORT>] \
                  [-c <CRITDAYS>] [-w <WARNDAYS>] [-t <TIMEOUT>]

The suggestion is to run this for each compute element on a low frequency, like once or a few times a day. A command definition like

define command {
    command_name check_archostcert
    command_line $USER$/check_archostcert -H $HOSTNAME$ -c 7 -w 31
}

will warn about a certificate one month before it expires and report a critical status one week before. The port number defaults to 2135, but can be changed with -p <port>, and a timeout of <T> seconds is specified as -t <T>. Se also check_archostcert --help.

The lifetime of the host certificate can also be checked using a generic HTTPS probe against the EMIES service, as long as the probe supports client authentication and lifetime checks.

[GLUE2]“GLUE Specification v2.0”; Sergio Andreozzi (ed.), et al.; http://www.ogf.org/documents/GFD.147.pdf
[GLUE2L](1, 2) “GLUE v. 2.0 – Reference Implementation of an LDAP Schema” Sergio Andreozzi (ed.), et al.; https://forge.ogf.org/sf/docman/do/downloadDocument/projects.glue-wg/docman.root.drafts/doc15526
[ARCIS2011]“The NorduGrid-ARC Information System”; Balázs Kónya and Daniel Johansson; NORDUGRID-TECH-4; http://www.nordugrid.org/documents/arc_infosys.pdf