======Checkpointing Programs======
=====Benefits of Using condor_compile=====
In order to run jobs in the "standard" universe, which supports special Condor features like //checkpointing// and //I/O redirection//.
"Checkpointing" means that if Condor has to stop your job for any reason and move your job to another computer, such as if a person logged on to the computer, Condor can save the state of your program and use that state to start your program on another computer from where it left off. So, if you have a program that takes 8 hours to run to completion but Condor has to kick it off of a computer that was running it for 6 hours, Condor will start your program on anther computer but at the 6 hour mark.
"I/O redirection" means that any data your program needs to run will be fetched from the computer you submitted the program on. So, if you submit a program from your home directory on your computer that needs a file called ''input.txt'' in your home directory, the program will open the file located on your computer instead of trying to find it on the execute machine that it is actually running on.
=====How to Use condor_compile=====
Luckily, Condor makes it easy to recompile your code to incorporate these features. All you have to do is run ''condor_compile'' with the command that you normally use to compile your program. For example, if you normally run
gcc -o HellowOrld HellowOrld.c
to compile your ''HellowOrld'' program, simply stick ''condor_compile'' in front of this command to compile your program to include the extra Condor features:
condor_compile gcc -o HellowOrld HellowOrld.c
**Note** that this will overwrite any existing non-Condor compilation called "''HellowOrld''". It may be a good idea to compile your program normally first and then compile it again with ''condor_compile'' but with a different name, such as ''condor_compile gcc -o HellowOrldCondor HellowOrld.c''. Although a program compiled with ''condor_compile'' should still have the same non-Condor features and functionality as one compiled normally, it may help with debugging to still have a copy of your program that was compiled normally.
=====Standalone Checkpointing=====
When you compile your program with ''condor_compile'', you can still run it directly from the command line like you normally would. In fact, you can even use checkpointing in your program without having to submit your job to Condor. After you recompile your program with ''condor_compile'', a lot of features become available for you to use.
====Enable Checkpointing====
When you run your program from the command line, you need to put a command in front of it to make it able to checkpoint properly. The purpose of this command is to disable memory address randomization which is enabled by default. If you normally run your program like so:
./HellowOrld arg1 arg2 ...
simply run it with ''setarch x86_64 -R -L'' in front of it, like so:
setarch x86_64 -R -L ./HellowOrld arg1 arg2 ...
:!: **Note**: This page assumes you have a 64-bit processor. If you have an older 32-bit processor architecture, use ''setarch i386'' instead of ''setarch x86_64'' when running your ''condor_compile''d program on your own 32-bit computer. To find out what kind of processor you have, run ''uname -p'' and use whatever is printed as the argument for ''setarch'' commands.
Your program will run normally on your computer. To kill it, press ''Ctrl-C'' like you normally do. This sends the ''SIGINT'' signal to your program which causes programs to die immediately.
====Save a Checkpoint and Exit====
To make your program save a checkpoint and exit, first run your ''condor_compile''d program with checkpointing enabled (see above). With your program running, simply press ''Ctrl-Z''. This sends the ''SIGSTOP'' signal to the currently running program. Usually, the ''SIGSTOP'' signal tells a process to freeze but not exit, but Condor has configured your program to interpret this signal to mean that it should create a checkpoint and then exit. When this happens, your program will create a checkpoint file with the same name as the program but ending in ''.ckpt''. So, if you checkpoint your ''HellowOrld'' program, the checkpoint file will be saved as ''HellowOrld.ckpt''.
====Save a Checkpoint without Exiting====
To make your program save a checkpoint but not exit, run your ''condor_compile''d program with checkpointing enabled. With your program running in one terminal window, open up a separate terminal window and run the ''killall'' command with the ''-s USR2'' argument to send your program the ''SIGUSR2'' signal, like so:
killall -s USR2 HellowOrld
Your program will receive the command and create a checkpoint but will continue running. That way, if the program is killed later and not given the chance to make a new checkpoint, you can start the program from where it left off at the checkpoint instead of completely starting over.
====Start from Checkpoint====
If you have checkpointed a program and so have the ''*.ckpt'' file, you can resume the program from the point that it was checkpointed. If, for example, your program is called ''HellowOrld'' and your checkpoint file is called ''HellowOrld.ckpt'', run this to load the program from the checkpoint file:
setarch x86_64 -R -L ./HellowOrld -_condor_restart HellowOrld.ckpt