Dealing with Failures and Crashes

Warning

This documentation is written as a tutorial that starts from a clean FireWorks database. If you have an existing FireWorks database and need to deal with a failure or crash, do not reset your FireWorks database as the tutorial instructs. Instead, first read through this tutorial to get an idea on how failures work in FireWorks. Then, refer to the database maintenance instructions.

Job exceptions, node failures, and system outages are all unfortunate realities of executing workflows. You’ll likely encounter some of these events when running FireWorks. This tutorial will simulate some of these events, so you can see how FireWorks detects job failures and what you can do about it.

Normal operation

Let’s first introduce normal operation of a Firework that prints starting, sleeps for 10 seconds, and then prints ending. The Firework is completed successfully only if ending gets printed.

  1. Move to the failures tutorial directory in your installation directory:

    cd <INSTALL_DIR>/fw_tutorials/failures
    
  2. Look inside fw_sleep.yaml. It should be pretty straightforward - we are printing text, sleeping, and printing text again.

    Note

    You can increase or decrease the sleep time, depending on your patience level and reaction time later on in the tutorial.

  3. Let’s add and run this Firework. You’ll have to wait 10 seconds for it to complete:

    lpad reset
    lpad add fw_sleep.yaml
    rlaunch singleshot
    
  4. Hopefully, your patience was rewarded with ending printed to your terminal. If so, let’s keep going!

Error during run - a FIZZLED Firework!

If your job throws an exception (error), FireWorks will automatically mark your job as FIZZLED. Any jobs that depend on this job will not run until you fix things. Let’s simulate this situation.

  1. Reset your database and add back the sleeping Firework:

    lpad reset
    lpad add fw_sleep.yaml
    
  2. We’ll run the Firework again, but this time you should interrupt its operation using the keyboard shortcut to stop execution(Ctrl+C). Make sure you hit that keyboard combo immediately after running the job, before you see the text ending:

    rlaunch singleshot
    (Ctrl+C)
    
  3. If you did this correctly, you’ll have seen the text starting but not the text ending. You might also see some error text printed to your terminal.

  4. This behavior is what happens when your job throws an error (such as the KeyboardInterrupt error we just simulated). Let’s see what became of this ill-fated Firework:

    lpad get_fws -i 1 -d all
    
  5. You should notice the state of this Firework is automatically marked as FIZZLED. In addition, if you look at the stored_data key, you’ll see that there’s information about the error that was encountered during the run. If you’re thorough, you’ll see something about a KeyboardInterruptError.

    Note

    If the exception thrown by the job implements the to_dict() method, this will be called to serialize customized information about the exception and add them to the stored_data.

  6. If at any point you want to review what FireWorks have FIZZLED, you can use the following query:

    lpad get_fws -s FIZZLED -d ids
    

Catastrophic Failure

The previous failure was easy to detect; the job threw an error, and the Rocket was able to catch that error and tell the LaunchPad to mark the job as FIZZLED. However, more catastrophic failures are possible. For example, you might have a power failure in your computer center. In that case, there is no time for the Rocket to report to FireWorks that there is a failure. Let’s see how to handle this case.

  1. Reset your database and add back the sleeping Firework:

    lpad reset
    lpad add fw_sleep.yaml
    
  2. We’ll run the Firework again, but this time you should interrupt its operation by forcibly closing your terminal window (immediately after running the job, before you see the text ending):

    rlaunch singleshot
    ----(forcibly close your terminal window)
    
  3. Now let’s re-open a terminal window and see what FireWorks thinks is happening with this job:

    lpad get_fws -i 1 -d more
    
  4. You should notice that FireWorks still thinks this job is RUNNING! We can fix this using the following command:

    lpad detect_lostruns --time 1 --fizzle
    

    Note

    Instead of using --fizzle, you could instead use --rerun. This would mark the Launch as being FIZZLED and then rerun the Firework.

    Note

    An additional constraint, --max_runtime, can be used if you are looking for jobs that ran only a short time before failing. This can be useful to track down if a job was killed because it did not have walltime to run (if it was started in the middle of the queue job). Note that you should set this parameter to be in slightly larger intervals of the ping_time, since runtime is determined using pings.

  5. This command will mark all jobs that have been running for more than 1 second as FIZZLED. We’ll improve this in a bit, but for now let’s check to make sure the command worked:

    lpad get_fws -i 1 -d more
    
  6. The Firework should now be correctly listed as FIZZLED!

  7. Of course, in production you’ll never want to mark all jobs running for 1 second as being FIZZLED; this will mark jobs that are running properly as FIZZLED!

  8. In production, you need not specify the --time parameter at all. FireWorks will automatically detect a job as FIZZLED after 4 hours of idle time when you run lpad detect_lostruns. Jobs that are running properly, even if they take longer than 4 hours, will not be marked as FIZZLED. This is because the Rocket will automatically ping the LaunchPad that it’s alive every hour. FireWorks will only mark jobs as FIZZLED when it does not receive this ping from the Rocket for 4 hours. You can test this feature with the following sequence of commands:

    lpad reset
    lpad add fw_sleep.yaml
    rlaunch singleshot
    ---(forcibly close your terminal window)
    ---(wait 4 or more hours!! or temporarily set your System Clock ahead by 5 hours)
    lpad detect_lostruns --fizzle
    lpad get_fws -i 1 -d all
    

Note

You can shorten the ping times and detection times by editing the settings in your FW configuration, but we suggest you leave them alone unless really needed.

Note

In production, you can use the database maintenance instructions instead of calling lpad_detect_lostruns --fizzle.

Life after FIZZLED

Once FireWorks has identified a job as FIZZLED, you might wonder what comes next. One option is to resubmit your workflow, perhaps with modifications to prevent any problems that might have caused job failure. If you’ve correctly enabled duplicate checking, your new workflow will automatically pick up where you left off, and you won’t do any extra calculations. This is the preferred way of dealing with failures. If you haven’t enabled duplicate checking, then you can also rerun your workflow, starting from the failed job. If the EXCEPT_DETAILS_ON_RERUN option is enabled in your FW configuration, the exception details serialized during the last launch will be copied in the spec under the key _exception_details. Customized exceptions can then be implemented to store information that help properly restart the job. The only caveat to this latter method is that dynamic actions already taken by your workflow will not be reset to their initial state.

Note that you might also find the [python debugger](https://docs.python.org/3/library/pdb.html) to be a useful tool in inspecting local variables at the point of failure of a given firework, particularly if your firework is composed of PyTasks. The debugger can be invoked upon an exception using the –pdb option with rlaunch, e.g.:

rlaunch singleshot -f FW_ID_HERE --pdb
(pdb) print locals()

You can also continue on with the Workflow even after FIZZLED by setting the _allow_fizzled_parents parameter in your spec. This will allow you to algorithmically fix errors using FireWorks’ dynamic workflow features. This is a fairly advanced use case and will be covered in a future tutorial.

Database locks and inconsistencies

When updating the state of the Firework, FireWorks needs to acquire a lock on the database to safely update the state of the whole workflow. As this procedure may require some time, if many Fireworks belonging to the same Workflow try to update their state simultaneously the waiting time could easily reach the limit (see WFLOCK_EXPIRATION_SECS in FW config). If this happens, FireWorks by default will let the job stop, leaving the database in an inconsistent state. Like in the previous cases, these jobs could be identified running lpad detect_lostruns and the consistency in the database could be restored using the option --refresh. This will refresh the state of the Workflow, applying the correct actions where needed.

Automatically report what parameters cause job failures (beta)

It is one thing to know that many jobs failed via the FIZZLED state, but it is better if one can identify the cause of failure. FireWorks can try to automatically detect what parameters are causing jobs to fail by introspecting the database and compiling a report of what keys in the FireWork spec and Workflow metadata are most associated with failed jobs. Thus, if you have a descriptive spec and metadata, it can be used to automatically classify jobs.

The introspect feature is in beta, but you can learn more via:

lpad introspect -h

The introspect command can be very powerful when used in conjunction with the reporting features (lpad report). Contact FireWorks support for more information.