Advanced queue submission (reservation mode)¶
Before we begin - if you’re here and haven’t completed the first tutorial on queue submission, you should go back and complete that first. This tutorial assumes that you already have queue submission working and just need to overcome some of the limitations of simple queue submission.
In this tutorial, we’ll introduce the notion of reserving FireWorks on queue submission. Some differences between the simple method of the previous tutorial and the reservation method are outlined below:
Situation |
Simple Queue Launching |
Reservation Queue Launching |
---|---|---|
write/submit queue script |
write generic script using QueueAdapter file alone |
1. reserve a FW from the database
2. use FW’s spec to modify queue script
|
queue manager runs queue script |
determine a FW to run and run it |
run the reserved FW |
job is deleted from queue |
no action needed by the user |
any affected reserved jobs must be unreserved by user using detect_unreserved |
run multiple FWs in one script |
supported |
currently unsupported |
offline mode |
unsupported |
supported |
Reserving jobs allows for more flexibility, but also adds maintenance overhead when queues go down or jobs in the queue are cancelled. Hence, there are some advantages to sticking with Simple Queue Launching. With that out of the way, let’s explore the reservation method of queue submission!
Reserving FireWorks¶
Begin in your working directory from the previous tutorial. You should have four files:
fw_test.yaml
,my_qadapter.yaml
,my_fworker.yaml
, andmy_launchpad.yaml
.Let’s reset our database and add a Firework for testing:
lpad reset lpad add fw_test.yaml
Reserving a Firework is as simple as adding the
-r
option to the Queue Launcher. Let’s queue up a reserved Firework and immediately check its state:qlaunch -r singleshot lpad get_fws -i 1 -d all
When you get the Firework, you should notice that its state is RESERVED. No other Rocket Launchers will run that Firework; it is now bound to your queue. Some details of the reservation are given in the launches key of the Firework. In addition, the state_history key should contain the reservation id of your submitted job.
There are few different commands you can use to leverage the reservation id (often shortened as qid) of your job:
lpad get_qid -i 1 # gets the queue id of FW 1 lpad get_fws --qid 1234 # gets the Firework with queue id 1234 lpad cancel_qid --qid 1234 # cancels reservation 1234. WARNING: the user must remove the job from the queue manually before executing this command.
When your queue runs and completes your job, you should see that the state is updated to COMPLETED:
lpad get_fws -i 1 -d more
Preventing too many jobs in the queue¶
One nice feature of reserving FireWorks is that you are automatically prevented from submitting more jobs to the queue than exist FireWorks in the database. Let’s try to submit too many jobs and see what happens.
Clean your working directory of everything but four files:
fw_test.yaml
,my_qadapter.yaml
,my_fworker.yaml
, andmy_launchpad.yaml
Reset the database and add a Firework for testing:
lpad reset lpad add fw_test.yaml
We have only one Firework in the database, so we should only be able to submit one job to the queue. Let’s try submitting two:
qlaunch -r singleshot qlaunch -r singleshot
You should see that the first submission went OK, but the second one told us
No jobs exist in the LaunchPad for submission to queue!
. If we repeated this sequence without the-r
option, we would submit too many jobs to the queue.Note
Once the job starts running or completes, both the simple version of the QueueLauncher and the reservation mode will stop you from submitting jobs. However, only the reservation mode will identify that a job is already queued.
Overriding Queue Parameters within the Firework¶
Another key feature of reserving FireWorks before queue submission is that the Firework can override queue parameters. This is done by specifying the _queueadapter
reserved key in the spec
. For example, let’s override the walltime parameter.
Clean your working directory of everything but four files:
fw_test.yaml
,my_qadapter.yaml
,my_fworker.yaml
, andmy_launchpad.yaml
Look in the file
my_qadapter.yaml
. You should have walltime parameter listed, perhaps set to 2 minutes. By default, all jobs submitted by this Queue Launcher would have a 2-minute walltime.Let’s copy over the
fw_walltime.yaml
file from the tutorials dir:cp <INSTALL_DIR>/fw_tutorials/queue_pt2/fw_walltime.yaml .
Look inside
fw_walltime.yaml
. You will see a_queueadapter
key in the spec that specifies awalltime
of 10 minutes. Anything in the_queueadapter
key will override the corresponding parameter inmy_qadapter.yaml
when the Queue Launcher is run in reservation mode. So now, the Firework itself is determining key properties of the queue submission.Let’s add and run this Firework:
lpad reset lpad add fw_walltime.yaml qlaunch -r singleshot
You might check the walltime that your job was submitted with using your queue manager’s built-in commands (e.g., qstat or mstat). You can also see the queue submission script by looking inside the file
FW_submit.script
. Inside, you’ll see the job was submitted with the walltime specified by your Firework, not the default walltime frommy_qadapter.yaml
.Your job should complete successfully as before. You could also try to override other queue parameters such as the number of cores for running the job or the account which is charged for running the job. In this way, your queue submission can be tailored on a per-job basis!
Limitations: dealing with failure¶
One limitation of reserving FireWorks is that the Firework’s fate is tied to that of the queue submission. If the place in the queue is deleted, that Firework is stuck in limbo unless you reset its state from RESERVED back to READY. Let’s try to simulate this:
Clean your working directory of everything but four files:
fw_test.yaml
,my_qadapter.yaml
,my_fworker.yaml
, andmy_launchpad.yaml
Let’s add and run this Firework. Before the job starts running, delete it from the queue (if you’re too slow, repeat this entire step):
lpad reset lpad add fw_test.yaml qlaunch -r singleshot qdel <JOB_ID>
Note
The job id should have been printed by the Queue Launcher, or you can check your queue manager. The
qdel
command might need to be modified, depending on the type of queue manager you use.Now we have no jobs in the queue. But our Firework still shows up as RESERVED:
lpad get_fws -i 1 -d more
Because our Firework is RESERVED, we cannot run it:
qlaunch -r singleshot
tells us that
No jobs exist in the LaunchPad for submission to queue!
. FireWorks thinks that our old queue submission (the one that we deleted) is going to run this Firework and is not letting us submit another queue script for the same job.The way to fix this is to find all reservations that have been stuck in a queue for a long time, and then cancel the reservation (“qdel”) them. The following command unreserves all FireWorks that have been stuck in a queue for 1 second or more (basically all FireWorks):
lpad detect_unreserved --time 1 --rerun
Note
In production, you will want to increase the
--time
parameter considerably. The default value is 2 weeks (--time 1209600
).Now the Firework should be in the READY state:
lpad get_fws -i 1 -d more
And we can run it again:
qlaunch -r singleshot
Note
If you un-reserve a Firework that is still in a queue and hasn’t crashed, the consequences are not so bad. FireWorks might submit a second job to the queue that reserves this same Firework. The first queue script to run will run the Firework properly. The second job to run will not find a Firework to run and simply exit.
Conclusion¶
As we demonstrated, reserving jobs in the queue has several advantages, but also adds the complication that queue failure can hold up a Firework until you run a command to free up broken reservations. Is is up to you which mode you prefer for your application. However, we suggest that you use only one of the two methods throughout your application. In particular, do not use the Simple Queue Launcher if you are defining the _queueadapter
parameter in your spec
.
If you are using the QueueLauncher in reservation mode, we suggest that you look at the tutorial on maintaining your FireWorks database (future). This will show you how to automatically clear out bad reservations periodically without needing human intervention.