Checkpointing Programs

Benefits of Using condor_compile

In order to run jobs in the “standard” universe, which supports special Condor features like checkpointing and I/O redirection.

“Checkpointing” means that if Condor has to stop your job for any reason and move your job to another computer, such as if a person logged on to the computer, Condor can save the state of your program and use that state to start your program on another computer from where it left off. So, if you have a program that takes 8 hours to run to completion but Condor has to kick it off of a computer that was running it for 6 hours, Condor will start your program on anther computer but at the 6 hour mark.

“I/O redirection” means that any data your program needs to run will be fetched from the computer you submitted the program on. So, if you submit a program from your home directory on your computer that needs a file called input.txt in your home directory, the program will open the file located on your computer instead of trying to find it on the execute machine that it is actually running on.

How to Use condor_compile

Luckily, Condor makes it easy to recompile your code to incorporate these features. All you have to do is run condor_compile with the command that you normally use to compile your program. For example, if you normally run

gcc -o HellowOrld HellowOrld.c

to compile your HellowOrld program, simply stick condor_compile in front of this command to compile your program to include the extra Condor features:

condor_compile gcc -o HellowOrld HellowOrld.c

Note that this will overwrite any existing non-Condor compilation called “HellowOrld”. It may be a good idea to compile your program normally first and then compile it again with condor_compile but with a different name, such as condor_compile gcc -o HellowOrldCondor HellowOrld.c. Although a program compiled with condor_compile should still have the same non-Condor features and functionality as one compiled normally, it may help with debugging to still have a copy of your program that was compiled normally.

Standalone Checkpointing

When you compile your program with condor_compile, you can still run it directly from the command line like you normally would. In fact, you can even use checkpointing in your program without having to submit your job to Condor. After you recompile your program with condor_compile, a lot of features become available for you to use.

Enable Checkpointing

When you run your program from the command line, you need to put a command in front of it to make it able to checkpoint properly. The purpose of this command is to disable memory address randomization which is enabled by default. If you normally run your program like so:

./HellowOrld arg1 arg2 ...

simply run it with setarch x86_64 -R -L in front of it, like so:

setarch x86_64 -R -L ./HellowOrld arg1 arg2 ...

:!: Note: This page assumes you have a 64-bit processor. If you have an older 32-bit processor architecture, use setarch i386 instead of setarch x86_64 when running your condor_compiled program on your own 32-bit computer. To find out what kind of processor you have, run uname -p and use whatever is printed as the argument for setarch commands.

Your program will run normally on your computer. To kill it, press Ctrl-C like you normally do. This sends the SIGINT signal to your program which causes programs to die immediately.

Save a Checkpoint and Exit

To make your program save a checkpoint and exit, first run your condor_compiled program with checkpointing enabled (see above). With your program running, simply press Ctrl-Z. This sends the SIGSTOP signal to the currently running program. Usually, the SIGSTOP signal tells a process to freeze but not exit, but Condor has configured your program to interpret this signal to mean that it should create a checkpoint and then exit. When this happens, your program will create a checkpoint file with the same name as the program but ending in .ckpt. So, if you checkpoint your HellowOrld program, the checkpoint file will be saved as HellowOrld.ckpt.

Save a Checkpoint without Exiting

To make your program save a checkpoint but not exit, run your condor_compiled program with checkpointing enabled. With your program running in one terminal window, open up a separate terminal window and run the killall command with the -s USR2 argument to send your program the SIGUSR2 signal, like so:

killall -s USR2 HellowOrld

Your program will receive the command and create a checkpoint but will continue running. That way, if the program is killed later and not given the chance to make a new checkpoint, you can start the program from where it left off at the checkpoint instead of completely starting over.

Start from Checkpoint

If you have checkpointed a program and so have the *.ckpt file, you can resume the program from the point that it was checkpointed. If, for example, your program is called HellowOrld and your checkpoint file is called HellowOrld.ckpt, run this to load the program from the checkpoint file:

setarch x86_64 -R -L ./HellowOrld -_condor_restart HellowOrld.ckpt
condor/submit/checkpointing.txt · Last modified: 2011/08/05 16:02 by garrettheath4
CC Attribution-Noncommercial-Share Alike 4.0 International
Driven by DokuWiki Recent changes RSS feed Valid CSS Valid XHTML 1.0