Troubleshooting Job Submissions

Job Stays on Queue

If your job appears to stay on the queue for a long time and you think there are execution machines that it should be able to run on, it may be that your program is executing on a machine but then encounters an error that causes it to be immediately killed. When this happens, the job is put back on the queue fast enough that you probably can't tell that it ever left.

You can find out what is going on by checking a chain of logs to figure out at what step your job failed. First, check the Negotiator log to see if your job was ever successfully matched. Run condor_q and note your job's ID. Then, run

condor_fetchlog `hostname` NEGOTIATOR | fgrep JOB_ID

where JOB_ID is the ID of your job as listed in the Condor queue, such as “26.0”. If the Negotiator found a suitable computer to run your job on, you should see one or more lines indicating the match, such as:

08/02/11 09:56:49     Request 00025.00000:
08/02/11 09:56:49       Matched 25.0 <> preempting none <>
08/02/11 10:02:49     Request 00025.00000:
08/02/11 10:02:49       Matched 25.0 <> preempting none <>
08/02/11 10:07:49     Request 00025.00000:
08/02/11 10:07:49       Matched 25.0 <> preempting none <>

If more than one match is listed for the same job (as in this example), that is a good indication that the job is running but failing immediately. To delve more into the cause of such an error, check the logs of one of the specific slots and machines that it was matched with:

condor_fetchlog RUN_MACHINE STARTER.slotX

where RUN_MACHINE is the machine that your job was matched with and slotX is the slot that it was potentially run on on that machine, such as

condor_fetchlog -master STARTER.slot1

If you tried to run your program recently, it should be listed at the bottom of the log. To find where the log for a specific run starts, look for a section surrounded by asterisks and containing the line

** condor_starter (CONDOR_STARTER) STARTING UP

and read the lines below that section to find out if any errors occurred while Condor tried to execute your job on the execute machine.

condor/submit/troubleshoot.txt · Last modified: 2011/08/02 11:49 by garrettheath4
CC Attribution-Noncommercial-Share Alike 4.0 International
Driven by DokuWiki Recent changes RSS feed Valid CSS Valid XHTML 1.0