custodian.custodian module

This module implements the main Custodian class, which manages a list of jobs given a set of error handlers, the abstract base classes for the ErrorHandlers and Jobs.

class custodian.custodian.Custodian(handlers, jobs, validators=None, max_errors_per_job=None, max_errors=1, polling_time_step=10, monitor_freq=30, skip_over_errors=False, scratch_dir=None, gzipped_output=False, checkpoint=False, terminate_func=None, terminate_on_nonzero_returncode=True)

Bases: object

The Custodian class is the manager for a list of jobs given a list of error handlers. The way it works is as follows:

Let’s say you have defined a list of jobs as [job1, job2, job3, …] and you have defined a list of possible error handlers as [err1, err2, …]
Custodian will run the jobs in the order of job1, job2, … During each job, custodian will monitor for errors using the handlers that have is_monitor == True. If an error is detected, corrective measures are taken and the particular job is rerun.
At the end of each individual job, Custodian will run through the list of error handlers that have is_monitor == False. If an error is detected, corrective measures are taken and the particular job is rerun.

Initializes a Custodian from a list of jobs and error handlers.

Parameters
- handlers ( [ErrorHandler]) – Error handlers. In order of priority of fixing.
- jobs ( [Job]) – Sequence of Jobs to be run. Note that this can be any sequence or even a generator yielding jobs.
- validators ( [Validator]) – Validators to ensure job success
- max_errors_per_job (int) – Maximum number of errors per job allowed before exiting. Defaults to None, which means it is set to be equal to max_errors..
- max_errors (int) – Maximum number of total errors allowed before exiting. Defaults to 1.
- polling_time_step (int) – The length of time in seconds between steps in which a job is checked for completion. Defaults to 10 secs.
- monitor_freq (int) – The number of polling steps before monitoring occurs. For example, if you have a polling_time_step of 10 seconds and a monitor_freq of 30, this means that Custodian uses the monitors to check for errors every 30 x 10 = 300 seconds, i.e., 5 minutes.
- skip_over_errors (bool) – If set to True, custodian will skip over error handlers that failed (raised an Exception of some sort). Otherwise, custodian will simply exit on unrecoverable errors. The former will lead to potentially more robust performance, but may make it difficult to improve handlers. The latter will allow one to catch potentially bad error handler implementations. Defaults to False.
- scratch_dir (str) – If this is set, any files in the current directory are copied to a temporary directory in a scratch space first before any jobs are performed, and moved back to the current directory upon completion of all jobs. This is useful in some setups where a scratch partition has much faster IO. To use this, set scratch_dir=root of directory you want to use for runs. There is no need to provide unique directory names; we will use python’s tempfile creation mechanisms. A symbolic link is created during the course of the run in the working directory called “scratch_link” as users may want to sometimes check the output during the course of a run. If this is None (the default), the run is performed in the current working directory.
- gzipped_output (bool) – Whether to gzip the final output to save space. Defaults to False.
- checkpoint (bool) – Whether to checkpoint after each successful Job. Checkpoints are stored as custodian.chk.#.tar.gz files. Defaults to False.
- terminate_func (callable) – A function to be called to terminate a running job. If None, the default is to call Popen.terminate.
- terminate_on_nonzero_returncode (bool) – If True, a non-zero return code on any Job will result in a termination. Defaults to True.

LOG_FILE(_ = ‘custodian.json_ )

classmethod from_spec(spec)

Load a Custodian instance where the jobs are specified from a structure and a spec dict. This allows simple custom job sequences to be constructed quickly via a YAML file.

Parameters

spec (dict) – A dict specifying job. A sample of the dict in YAML format for the usual MP workflow is given as follows
```
``
```
` jobs:
- jb: custodian.vasp.jobs.VaspJob
params:
```
final: False
suffix: .relax1
```
- jb: custodian.vasp.jobs.VaspJob params:
final: True suffix: .relax2 settings_override: {“file”: “CONTCAR”, “action”: {“_file_copy”: {“dest”: “POSCAR”}}

jobs_common_params:
```
  vasp_cmd: /opt/vasp
```
handlers:
- hdlr: custodian.vasp.handlers.VaspErrorHandler
- hdlr: custodian.vasp.handlers.AliasingErrorHandler
- hdlr: custodian.vasp.handlers.MeshSymmetryHandler validators:
- vldr: custodian.vasp.validators.VasprunXMLValidator custodian_params:
scratch_dir: /tmp
```
``
```
```
`
```
The jobs key is a list of jobs. Each job is specified via “job”: , and all parameters are specified via params which is a dict.

common_params specify a common set of parameters that are passed to all jobs, e.g., vasp_cmd.
Returns

Custodian instance.

run()

Runs all jobs.

Returns

All errors encountered as a list of list. [[error_dicts for job 1], [error_dicts for job 2], ….]
Raises
- ValidationError – if a job fails validation
- ReturnCodeError – if the process has a return code different from 0
- NonRecoverableError – if an unrecoverable occurs
- MaxCorrectionsPerJobError – if max_errors_per_job is reached
- MaxCorrectionsError – if max_errors is reached
- MaxCorrectionsPerHandlerError – if max_errors_per_handler is reached

run_interrupted()

Runs custodian in a interuppted mode, which sets up and validates jobs but doesn’t run the executable

Returns

number of remaining jobs
Raises
- ValidationError – if a job fails validation
- ReturnCodeError – if the process has a return code different from 0
- NonRecoverableError – if an unrecoverable occurs
- MaxCorrectionsPerJobError – if max_errors_per_job is reached
- MaxCorrectionsError – if max_errors is reached
- MaxCorrectionsPerHandlerError – if max_errors_per_handler is reached

exception custodian.custodian.CustodianError(message, raises=False)

Bases: RuntimeError

Exception class for Custodian errors.

Initializes the error with a message.

Parameters
- message (str) – Message passed to Exception
- raises (bool) – Whether this should be raised outside custodian

class custodian.custodian.ErrorHandler()

Bases: MSONable

Abstract base class defining the interface for an ErrorHandler.

abstract check()

This method is called during the job (for monitors) or at the end of the job to check for errors.

Returns

(bool) Indicating if errors are detected.

abstract correct()

This method is called at the end of a job when an error is detected. It should perform any corrective measures relating to the detected error.

Returns

(dict) JSON serializable dict that describes the errors and actions taken. E.g. {“errors”: list_of_errors, “actions”: list_of_actions_taken}. If this is an unfixable error, actions should be set to None.

is_monitor(_ = Fals_ )

This class property indicates whether the error handler is a monitor, i.e., a handler that monitors a job as it is running. If a monitor-type handler notices an error, the job will be sent a termination signal, the error is then corrected, and then the job is restarted. This is useful for catching errors that occur early in the run but do not cause immediate failure.

is_terminating(_ = Tru_ )

Whether this handler terminates a job upon error detection. By default, this is True, which means that the current Job will be terminated upon error detection, corrections applied, and restarted. In some instances, some errors may not need the job to be terminated or may need to wait for some other event to terminate a job. For example, a particular error may require a flag to be set to request a job to terminate gracefully once it finishes its current task. The handler to set the flag should be classified as is_terminating = False to not terminate the job.

max_num_corrections(_ = Non_ )

property n_applied_corrections()

The number of times the handler has given a correction and this has been applied.

Returns

the number of corrections applied.
Return type

(int)

raise_on_max(_ = Fals_ )

Whether corrections from this specific handler should be applied only a fixed maximum number of times on a single job (i.e. the counter is reset at the beginning of each job). If the maximum number is reached the code will either raise a MaxCorrectionsPerHandlerError (raise_on_max==True) or stops considering the correction (raise_on_max==False). If max_num_corrections is None this option is not considered. These options can be overridden as class attributes of the subclass or as customizable options setting an instance attribute from init.

raises_runtime_error(_ = Tru_ )

Whether this handler causes custodian to raise a runtime error if it cannot handle the error (i.e. if correct returns a dict with “actions”:None, or “actions”:[])

class custodian.custodian.Job()

Bases: MSONable

Abstract base class defining the interface for a Job.

property name()

A nice string name for the job.

abstract postprocess()

This method is called at the end of a job, after error detection. This allows post-processing, such as cleanup, analysis of results, etc.

abstract run()

This method perform the actual work for the job. If parallel error checking (monitoring) is desired, this must return a Popen process.

abstract setup()

This method is run before the start of a job. Allows for some pre-processing.

terminate()

Implement termination function.

exception custodian.custodian.MaxCorrectionsError(message, raises, max_errors)

Bases: CustodianError

Error raised when the maximum allowed number of errors is reached

Parameters
- message (str) – Message passed to Exception
- raises (bool) – Whether this should be raised outside custodian
- max_errors (int) – the number of errors reached

exception custodian.custodian.MaxCorrectionsPerHandlerError(message, raises, max_errors_per_handler, handler)

Bases: CustodianError

Error raised when the maximum allowed number of errors per handler is reached

Parameters
- message (str) – Message passed to Exception
- raises (bool) – Whether this should be raised outside custodian
- max_errors_per_handler (int) – the number of errors per job reached
- handler (Handler) – the handler that caused the exception

exception custodian.custodian.MaxCorrectionsPerJobError(message, raises, max_errors_per_job, job)

Bases: CustodianError

Error raised when the maximum allowed number of errors per job is reached

Parameters
- message (str) – Message passed to Exception
- raises (bool) – Whether this should be raised outside custodian
- max_errors_per_job (int) – the number of errors per job reached
- job (Job) – the job that was stopped

exception custodian.custodian.NonRecoverableError(message, raises, handler)

Bases: CustodianError

Error raised when a handler found an error but could not fix it

Parameters
- message (str) – Message passed to Exception
- raises (bool) – Whether this should be raised outside custodian
- handler (Handler) – Handler that caused the exception.

exception custodian.custodian.ReturnCodeError(message, raises=False)

Bases: CustodianError

Error raised when the process gave non zero return code

Initializes the error with a message.

Parameters
- message (str) – Message passed to Exception
- raises (bool) – Whether this should be raised outside custodian

exception custodian.custodian.ValidationError(message, raises, validator)

Bases: CustodianError

Error raised when a validator does not pass the check

Parameters
- message (str) – Message passed to Exception
- raises (bool) – Whether this should be raised outside custodian
- validator (Validator) – Validator that caused the exception.

class custodian.custodian.Validator()

Bases: MSONable

Abstract base class defining the interface for a Validator. A Validator differs from an ErrorHandler in that it does not correct a run and is run only at the end of a Job. If errors are detected by a Validator, a run is immediately terminated.

abstract check()

This method is called at the end of a job.

Returns

(bool) Indicating if errors are detected.