Monitoring ARC Compute Elements by Job Submission¶
General Configuration¶
For each CE to monitor run
check_arcce_submit -H <HOST>
This should be run at a relatively low frequency in order to let one job finish before the next is submitted. The probe keeps track of submitted jobs, and will hold the next submission if necessary. Subsequent sections describe additional options for testing data-staging, running custom scripts, etc.
On a more regular basis, each 5 min or so, run
check_arcce_monitor
which will monitor all job status of each host and submit it passively to a service matching the host name and the service description “ARCCE Job Termination”. The passive service name can be configured.
Finally, a probe is provided to tidy the ARC job list after unsuccessful attempts by check_arcce_monitor to clean jobs. This is also set up as a single service, and only needs to run occasionally, like once a day:
check_arcce_clean
For additional options, see
check_arcce_submit --help
check_arcce_monitor --help
check_arcce_clean --help
Plugin Configuration¶
The main configuration section for this probe is arcce
, see
Configuration Files. This probe requires an X509 proxy, see
Proxy Certificate.
Connection URLs for job submission (the --ce
option) may be specified in
the section arcce.connection_urls
.
Example:
[arcce]
voms = ops
user_cert = /etc/nagios/globus/robot-cert.pem
user_key = /etc/nagios/globus/robot-key.pem
loglevel = DEBUG
[arcce.connection_urls]
arc1.example.org = ARC1:https://arc1.example.org:443/ce-service
arc0.example.org = ARC0:arc0.example.org:2135/nordugrid-cluster-name=arc0.example.org,Mds-Vo-name=local,o=grid
The user_key
and user_cert
options may be better placed in the common
gridproxy
section.
Nagios Configuration¶
You will need command definitions for monitoring and submission:
define command {
command_name check_arcce_monitor
command_line $USER1$/check_arcce_monitor -H $HOSTNAME$
}
define command {
command_name check_arcce_clean
command_line $USER1$/check_arcce_clean -H $HOSTNAME$
}
define command {
command_name check_arcce_submit
command_line $USER1$/check_arcce_submit -H $HOSTNAME$ \
[--test <test_name> ...]
}
For monitoring, add a single service like
define service {
use monitoring-service
host_name localhost
service_description ARCCE Monitoring
check_command check_arcce_monitor
}
define service {
use monitoring-service
host_name localhost
service_description ARCCE Cleaner
check_command check_arcce_clean
normal_check_interval 1440
retry_check_interval 120
}
For each host, add something like
define service {
use submission-service
host_name arc0.example.org
service_description ARCCE Job Submission
check_command check_arcce_submit
}
define service {
use passive-service
host_name arc0.example.org
service_description ARCCE Job Termination
check_command check_passive
}
The --test <test_name>
options enables tests to run in addition to a plain
job submission. They are specified in individual sections of the
configuration files as described below. Such a test may optionally submit the
results to a named passive service instead of the above termination service.
To do so, add the Nagios configuration for the service and duplicate the
“service_description
” in the section defining the test.
See the arcce-example.cfg for a more complete Nagios configuration.
Running Multiple Job Services on the Same Host¶
By default, running jobs are tracked on a per-host basis. To define multiple
job submission services for the same host, pass to --job-tag
a tag which
identify the service uniquely on this host. Remember to also add a passive
service and pass the corresponding --termination-service
option.
The scheme for configuring an auxiliary submission/termination service is:
define command {
command_name check_arcce_submit_<test_name>
command_line $USER1$/check_arcce_submit -H $HOSTNAME$ \
--job-tag <test_name> \
--termination-service 'ARCCE Job Termination for <Test-Description>' \
[--test <test1> ...]
}
define service {
use submission-service
host_name arc0.example.org
service_description ARCCE Job Submission for <Test-Description>
check_command check_arcce_submit_<test_name>
}
define service {
use passive-service
host_name arc0.example.org
service_description ARCCE Job Termination for <Test-Description>
check_command check_passive
}
Custom Job Descriptions¶
If the generated job scripts and job descriptions are not sufficient, you can
provide hand-written ones by passing the --job-description
option to the
check_arcce_submit
command. This option is incompatible with --test
.
Currently no substitutions are done in the job description file, other than what may be provided by ARC.
Job Tests¶
Scripted Checks¶
It is possible to add custom commands to the job scripts and do a regular expression match on the output. E.g. to test that Python is installed and report the version, add the following section to the plugin configuration file:
[arcce.python]
jobplugin = scripted
required_programs = python
script_line = python -V >python.out 2>&1
output_file = python.out
output_pattern = Python\s+(?P<version>\S+)
status_ok = Found Python version %(version)s.
status_critical = Python version not found in output.
service_description = ARCCE Python version
The options are
- required_programs
Space-separated list of programs to check for before running the script. If one of the programs is not found, it’s reported as a critical error.
- script_line
One-liner shell code to run, including features commonly supported by
/bin/sh
on year CEs.- output_file
The name of the file your script produces. This is mandatory, and the same file will be used to communicate errors back to
check_arcce_monitor
. The reason standard output is not used, is to allow multiple job tests to publish independent passive results.- output_pattern
This is a Python regular expression which is searched for in the output of the script. It will stop on the first matched line. You cannot match more than one line, so distill the output in
script_line
if necessary. Named regular expression groups of the form(?<v>...)
captures their output in a variable v, which can be substituted in the status messages.- status_ok
The status message if the above regular expression matches. A named regular expression group captured in a variable v can be substituted with
%(v)s
.- status_critical
Status message if the regular expression does not match. Obviously you cannot do substitutions of RE groups. If the test for required programs fail, then the status message will indicate which programs are missing instead.
- service_description
The
service_description
of the passive Nagios service to which results are reported.
See Probe Configuration for more examples.
It is possible to give more control over the probe status to the remote
script. Instead of output_pattern
the script may pass status messages and
an exit code back to Nagios. This is done by printing certain magic strings
to the file specified by output_file
:
__status <status-code> <status-message>
sets the exit code and status line of the probe.
__log <level> <message>
emits an additional status line which will de shown iff the log level set in the probe configuration is at least<level>
, which is a numeric value from the Pythonlogging
module.
__exit <exit-code>
is used to report the exit code of a script. Anything other than 0 will cause a CRITICAL status. You probably don’t want to use this yourself.
The __status
line may occur before, between, or after __log
lines.
This can be convenient to log detailed check results and issues before the
final status in known.
It possible to adapt this to a Nagios-style probe check_foo
by wrapping it
in some shell code:
script_line = (/bin/sh check_foo 2>&1; echo __status $?) | \
(read msg; sed -e 's/^/__log 20 /' -e '$s;^__log 20 \(.*\);\1 '"$msg;") \
> check_foo.out
output_file = check_foo.out
staged_inputs = file:////path-to/check_foo
Staging Checks¶
The “staging” job plug-in checks that file staging works in connection with
job submission. It is enabled with --test <test-name>
where the
plugin configuration file contains a corresponding section:
[arcce.<test-name>]
jobplugin = staging
staged_inputs = <URL> ... <URL>
staged_outputs = <URL> ... <URL>
service_description = <TARGET-FOR-PASSIVE-RESULT>
Note that the URLs are space-separated. They can be placed separate indented lines. Within the URLs, the following substitutions may be useful:
%(hostname)s
The argument to the
-H
option if passed to the probe, else “localhost”.%(epoch_time)s
The integer number of seconds since Epoch.
If a staging check fails, the whole job will fail, so it’s status cannot be
submitted to an individual passive service as with scripted checks. For this
reason, it may be preferable to create one or more individual submission
services dedicated to file staging. Remember to pass unique names to
--job-tag
to isolate them.
Custom Substitutions in Job Test Sections¶
In job test sections you can use substitutions of the form %(<var-name>)s
,
where <var-name>
is defined in a separate section as described as follows.
Variable definitions can themselves contain substitutions of this kind.
Cyclic definitions are detected and reported as UNKNOWN.
Probe Option. A section of the form
[variable.<var>]
method = option
default = <default-value>
declares <var>
as an option which can be passed to the probe with -O
<var>=<value>
. The default
field may be omitted, in which case the
probe option becomes mandatory for any tests using the variable.
UNIX Environment.
A section of the following form declares that <var>
shall be imported from
the UNIX environment. If no default value is provided, then the environment
variable must be exported to the probe.
[variable.<var>]
method = getenv
envvar = <VARIBLE>
The envvar
line optionally specifies the name of the variable to look up,
which otherwise defaults to <var>
.
Pipe Output. The following allows you to capture the output of a shell command:
[variable.<var>]
method = pipe
command = <command-line>
Custom Time Stamp.
This method provides a custom time stamp format as an alternative to
%(epoch_time)s
. It takes the form
[variable.<var>]
method = strftime
format = <escaped-strftime-style-format>
Note that the %
characters in the format
field must be escaped as
%%
, as to avoid attempts to parse them as interpolations. An alternative
raw_format
field can be used, which is interpreted literally.
Random Line from File.
A section of the following form picks a random line from <path>
. A low
entropy system source is used for seeding.
[variable.<var>]
method = random_line
input_file = <path>
exclude = <optional-space-separated-list>
Leading and trailing spaces are trimmed, empty lines and lines starting with a
#
character are ignored. If provided, any lines matching one of the
space-separated words in exclude
are ignored, as well.
Switch. If you need to set a variable on a case to case basis, the form is
[variable.<var>]
method = switch
index = <index-value>
case[<index-1>] = <value-1>
# ...
case[<index-n>] = <value-n>
default = <default-value>
This will first expand “<index-value>
”. If this matches “<index-<i>>
”
for some “<i>
”, then the expansion of <value-<i>>
is returned,
otherwise <default-value>
. See also the example below.
LDAP Search. A value can be extracted from an LDAP attribute using
[variable.<var>]
method = ldap
uri = <ldap-uri>( <ldap-uri>)*
filter = <ldap-filter>
attribute = <ldap-attribute>
default = <optional-default-value>
If multiple records are returned, the first returned record which provides a value for the requested attribute is used. If the attribute has multiple values, the first returned value is used. Note that the LDAP server may not guarantee stable ordering.
Example.
In the following staging tests, %(se_host)s
is replaced by a random host
name from the file /var/lib/gridprobes/ops/goodses.conf
, and %(now)s
is replaced by a customized time stamp.
[arcce.srm]
jobplugin = staging
staged_output = srm://%(se_host)s/%(se_dir)s/%(hostname)s-%(now)s.txt
service_description = Test Service
[variable.se_host]
method = random_line
input_file = /var/lib/gridprobes/ops/goodses.conf
[variable.now]
method = strftime
raw_format = %FT%T
[variable.se_dir]
method = switch
index = %(se_host)s
case[se-1.example.org] = /pnfs/se-1.example.com/nagios-testfiles
case[se-2.example.org] = /dpm/se-2.example.com/home/nagios-testfiles
default = /nagios-testfiles