This is an old revision of the document!


Checkpointing Programs

Benefits of Using condor_compile

In order to run jobs in the “standard” universe, which supports special Condor features like checkpointing and I/O redirection.

“Checkpointing” means that if Condor has to stop your job for any reason and move your job to another computer, such as if a person logged on to the computer, Condor can save the state of your program and use that state to start your program on another computer from where it left off. So, if you have a program that takes 8 hours to run to completion but Condor has to kick it off of a computer that was running it for 6 hours, Condor will start your program on anther computer but at the 6 hour mark.

“I/O redirection” means that any data your program needs to run will be fetched from the computer you submitted the program on. So, if you submit a program from your home directory on your computer that needs a file called input.txt in your home directory, the program will open the file located on your computer instead of trying to find it on the execute machine that it is actually running on.

How to Use condor_compile

Luckily, Condor makes it easy to recompile your code to incorporate these features. All you have to do is run condor_compile with the command that you normally use to compile your program. For example, if you normally run

gcc -o HellowOrld HellowOrld.c

to compile your HellowOrld program, simply stick condor_compile in front of this command to compile your program to include the extra Condor features:

condor_compile gcc -o HellowOrld HellowOrld.c

How to do Standalone Checkpointing

When you compile your program with condor_compile, you can still run it directly from the command line like you normally would. In fact, you can even use checkpointing in your program without having to submit your job to Condor. After you recompile your program with condor_compile, a lot of features become available for you to use.

Enable Checkpointing

When you run your program from the command line, you need to put a command in front of it to make it able to checkpoint properly. The purpose of this command is to disable memory address randomization which is enabled by default. If you normally run your program like so:

./HellowOrld arg1 arg2 ...

simply run it with setarch x86_64 -R -L in front of it, like so:

setarch x86_64 -R -L ./HellowOrld arg1 arg2 ...

Your program will run normally on your computer. To kill it, press Ctrl-C like you normally do. This sends the SIGINT signal to your program which causes programs to die immediately.

Save a Checkpoint and Exit

To make your program save a checkpoint and exit, first run your condor_compiled program with checkpointing enabled (see above). With your program running, simply press Ctrl-Z. This sends the SIGSTOP signal to the currently running program. Usually, the SIGSTOP signal tells a process to freeze but not exit, but Condor has configured your program to interpret this signal to mean that it should create a checkpoint and then exit. When this happens, your program will create a checkpoint file with the same name as the program but ending in .ckpt. So, if you checkpoint your HellowOrld program, the checkpoint file will be saved as HellowOrld.ckpt.

condor/submit/checkpointing.1312555646.txt.gz · Last modified: 2011/08/05 14:47 by garrettheath4
CC Attribution-Noncommercial-Share Alike 4.0 International
Driven by DokuWiki Recent changes RSS feed Valid CSS Valid XHTML 1.0