custodian package


custodian.custodian module

class Custodian(handlers, jobs, validators=None, max_errors=1, polling_time_step=10, monitor_freq=30, skip_over_errors=False, scratch_dir=None, gzipped_output=False, checkpoint=False, terminate_func=None, terminate_on_nonzero_returncode=True)[source]

Bases: object

The Custodian class is the manager for a list of jobs given a list of error handlers. The way it works is as follows:

  1. Let’s say you have defined a list of jobs as [job1, job2, job3, ...] and you have defined a list of possible error handlers as [err1, err2, ...]
  2. Custodian will run the jobs in the order of job1, job2, ... During each job, custodian will monitor for errors using the handlers that have is_monitor == True. If an error is detected, corrective measures are taken and the particular job is rerun.
  3. At the end of each individual job, Custodian will run through the list error handlers that have is_monitor == False. If an error is detected, corrective measures are taken and the particular job is rerun.

Initializes a Custodian from a list of jobs and error handler.s

  • handlers ([ErrorHandler]) – Error handlers. In order of priority of fixing.
  • jobs ([Job]) – Sequence of Jobs to be run. Note that this can be any sequence or even a generator yielding jobs.
  • validators ([Validator]) – Validators to ensure job success
  • max_errors (int) – Maximum number of errors allowed before exiting. Defaults to 1.
  • polling_time_step (int) – The length of time in seconds between steps in which a job is checked for completion. Defaults to 10 secs.
  • monitor_freq (int) – The number of polling steps before monitoring occurs. For example, if you have a polling_time_step of 10 seconds and a monitor_freq of 30, this means that Custodian uses the monitors to check for errors every 30 x 10 = 300 seconds, i.e., 5 minutes.
  • skip_over_errors (bool) – If set to True, custodian will skip over error handlers that failed (raised an Exception of some sort). Otherwise, custodian will simply exit on unrecoverable errors. The former will lead to potentially more robust performance, but may make it difficult to improve handlers. The latter will allow one to catch potentially bad error handler implementations. Defaults to False.
  • scratch_dir (str) – If this is set, any files in the current directory are copied to a temporary directory in a scratch space first before any jobs are performed, and moved back to the current directory upon completion of all jobs. This is useful in some setups where a scratch partition has much faster IO. To use this, set scratch_dir=root of directory you want to use for runs. There is no need to provide unique directory names; we will use python’s tempfile creation mechanisms. A symbolic link is created during the course of the run in the working directory called “scratch_link” as users may want to sometimes check the output during the course of a run. If this is None (the default), the run is performed in the current working directory.
  • gzipped_output (bool) – Whether to gzip the final output to save space. Defaults to False.
  • checkpoint (bool) – Whether to checkpoint after each successful Job. Checkpoints are stored as custodian.chk.#.tar.gz files. Defaults to False.
  • terminate_func (callable) – A function to be called to terminate a running job. If None, the default is to call Popen.terminate.
  • terminate_on_nonzero_returncode (bool) – If True, a non-zero return code on any Job will result in a termination. Defaults to True.
LOG_FILE = 'custodian.json'
classmethod from_spec(spec)[source]

Load a Custodian instance where the jobs are specified from a structure and a spec dict. This allows simple custom job sequences to be constructed quickly via a YAML file.

Parameters:spec (dict) –

A dict specifying job. A sample of the dict in YAML format for the usual MP workflow is given as follows:

``` jobs: - jb:

final: False suffix: .relax1
  • jb: params:
    final: True suffix: .relax2 settings_override: {“file”: “CONTCAR”, “action”: {“_file_copy”: {“dest”: “POSCAR”}}
vasp_cmd: /opt/vasp

handlers: - hdlr: custodian.vasp.handlers.VaspErrorHandler - hdlr: custodian.vasp.handlers.AliasingErrorHandler - hdlr: custodian.vasp.handlers.MeshSymmetryHandler validators: - vldr: custodian.vasp.validators.VasprunXMLValidator custodian_params:

scratch_dir: /tmp


The jobs key is a list of jobs. Each job is specified via “job”: <explicit path>, and all parameters are specified via params which is a dict.

common_params specify a common set of parameters that are passed to all jobs, e.g., vasp_cmd.

Returns:Custodian instance.

Runs all the jobs jobs.

Returns:All errors encountered as a list of list. [[error_dicts for job 1], [error_dicts for job 2], ....]

Runs custodian in a interuppted mode, which sets up and validates jobs but doesn’t run the executable


number of remaining jobs

  • CustodianError on unrecoverable errors, and jobs that fail
  • validation
exception CustodianError(message, raises=False, validator=None)[source]

Bases: Exception

Exception class for Custodian errors.

Initializes the error with a message.

  • message (str) – Message passed to Exception
  • raises (bool) – Whether this should raise a runtime error when caught
  • validator (Validator/ErrorHandler) – Validator or ErrorHandler that caused the exception.
class ErrorHandler[source]

Bases: monty.json.MSONable

Abstract base class defining the interface for an ErrorHandler.


This method is called during the job (for monitors) or at the end of the job to check for errors.

Returns:(bool) Indicating if errors are detected.

This method is called at the end of a job when an error is detected. It should perform any corrective measures relating to the detected error.

Returns:(dict) JSON serializable dict that describes the errors and actions taken. E.g. {“errors”: list_of_errors, “actions”: list_of_actions_taken}. If this is an unfixable error, actions should be set to None.
is_monitor = False

This class property indicates whether the error handler is a monitor, i.e., a handler that monitors a job as it is running. If a monitor-type handler notices an error, the job will be sent a termination signal, the error is then corrected, and then the job is restarted. This is useful for catching errors that occur early in the run but do not cause immediate failure.

is_terminating = True

Whether this handler terminates a job upon error detection. By default, this is True, which means that the current Job will be terminated upon error detection, corrections applied, and restarted. In some instances, some errors may not need the job to be terminated or may need to wait for some other event to terminate a job. For example, a particular error may require a flag to be set to request a job to terminate gracefully once it finishes its current task. The handler to set the flag should be classified as is_terminating = False to not terminate the job.

raises_runtime_error = True

Whether this handler causes custodian to raise a runtime error if it cannot handle the error (i.e. if correct returns a dict with “actions”:None, or “actions”:[])

class Job[source]

Bases: monty.json.MSONable

Abstract base class defining the interface for a Job.


A nice string name for the job.


This method is called at the end of a job, after error detection. This allows post-processing, such as cleanup, analysis of results, etc.


This method perform the actual work for the job. If parallel error checking (monitoring) is desired, this must return a Popen process.


This method is run before the start of a job. Allows for some pre-processing.

class Validator[source]

Bases: monty.json.MSONable

Abstract base class defining the interface for a Validator. A Validator differs from an ErrorHandler in that it does not correct a run and is run only at the end of a Job. If errors are detected by a Validator, a run is immediately terminated.


This method is called at the end of a job.

Returns:(bool) Indicating if errors are detected.

custodian.utils module

backup(filenames, prefix='error')[source]

Backup files to a tar.gz file. Used, for example, in backing up the files of an errored run before performing corrections.

  • filenames ([str]) – List of files to backup. Supports wildcards, e.g., ..
  • prefix (str) – prefix to the files. Defaults to error, which means a series of error.1.tar.gz, error.2.tar.gz, ... will be generated.

Tries to return a tuple describing the execution host. Doesn’t work for all queueing systems


Module contents