If your job appears to stay on the queue for a long time and you think there are execution machines that it should be able to run on, it may be that your program is executing on a machine but then encounters an error that causes it to be immediately killed. When this happens, the job is put back on the queue fast enough that you probably can't tell that it ever left.
You can find out what is going on by checking a chain of logs to figure out at what step your job failed. First, check the Negotiator log to see if your job was ever successfully matched. Run condor_q
and note your job's ID. Then, run
condor_fetchlog `hostname` NEGOTIATOR | fgrep JOB_ID
where JOB_ID
is the ID of your job as listed in the Condor queue, such as “26.0
”. If the Negotiator found a suitable computer to run your job on, you should see one or more lines indicating the match, such as:
08/02/11 09:56:49 Request 00025.00000: 08/02/11 09:56:49 Matched 25.0 koller@cs.wlu.edu <137.113.118.66:9692> preempting none <137.113.118.65:9697> slot1@fred.cs.wlu.edu 08/02/11 10:02:49 Request 00025.00000: 08/02/11 10:02:49 Matched 25.0 koller@cs.wlu.edu <137.113.118.66:9692> preempting none <137.113.118.65:9697> slot1@fred.cs.wlu.edu 08/02/11 10:07:49 Request 00025.00000: 08/02/11 10:07:49 Matched 25.0 koller@cs.wlu.edu <137.113.118.66:9692> preempting none <137.113.118.65:9697> slot1@fred.cs.wlu.edu
If more than one match is listed for the same job (as in this example), that is a good indication that the job is running but failing immediately. To delve more into the cause of such an error, check the logs of one of the specific slots and machines that it was matched with:
condor_fetchlog RUN_MACHINE STARTER.slotX
where RUN_MACHINE is the machine that your job was matched with and slotX is the slot that it was potentially run on on that machine, such as
condor_fetchlog -master fred.cs.wlu.edu STARTER.slot1
If you tried to run your program recently, it should be listed at the bottom of the log. To find where the log for a specific run starts, look for a section surrounded by asterisks and containing the line
** condor_starter (CONDOR_STARTER) STARTING UP
and read the lines below that section to find out if any errors occurred while Condor tried to execute your job on the execute machine.