Table of Contents
Condor TODO List
Note
Note: This page is deprecated. The new TODO list is located on this Tasks To-Do List.
Pending
- Configure firewalls to open port range 9600-9700 on all Condor computers.
- Install Monit to monitor system services and processes to ensure that they are running at all times. Configure it so that if Condor crashes, Monit will automagically restart them.
- Update links to point to newer Condor v. 7.6.4 release directory instead of old v. 7.6.1.
- Set up
john
to be an email server so thatcondor-admin@condor.cs.wlu.edu
can receive mail (and just forward it to me).Configure mail rerouting so that outgoing mail will be rerouted to thesendmail
service onterras
. That way, thesendmail
service doesn't have to run all the time.- Set up basic email server for
condor@condor.cs.wlu.edu
to send Condor mail andcondor-admin@condor.cs.wlu.edu
to receive mail (and forward to me). - Change public email address to be
condor-admin@condor.cs.wlu.edu
in mail map file, wiki homepage, Condor configuration file, and installation script.
- Get parallel universe to work
- Install MPI on all of the nodes?
- Get Java universe to work
- For Dr. Lambert's CS 320 Parallel Computing class
- Enable “no root squashing” on NAS to prevent local
root
users from modifying files. - Make Condor's slot allocation “nicer”.
- Fix the
RANK
function so that jobs are allocated in a breadth-first manner instead of a depth-first manner. Evenly distribute the work between machines. Ex: 4 submitted jobs on 4 machines with 2 slots each should run with one job on each of the 4 machines instead of all the jobs running on the first two machines and none on the second two machines. Use theTotalLoadAvg
variable in theRANK
function to take the total CPU usage into account.
- Implement DAGMan to allow Lee to introduce ordering within the job submissions.
- Create custom condor.py script to help users easily submit their jobs
- Remove “local” argument in the
Shell
object__init__
method. (See TODO there.) - Use kwargs or the command module to make a
queueMany(argsLst, outLst, inLst, logLst, …)
method in the Condor module with a variable number of arguments. - Start an interactive session after
Condor.submit()
is run so that the user can (manually or automatically) poll for the current status of their jobs, remove the jobs, etc.
- List and answer questions asked in the beginning of the installation section of the Condor manual. Add short reasons why you chose this and link any install pages that offer more detail. Put this in the “Installation homepage” for the installation section (it's too empty anyway).
- Make “How to” page about how to configure the Apache web server and install CondorView onto it.
- Make “How to” pages for how to submit jobs.
- A bunch of MATLABs from a list of strings.
- Change
babbage
's configuration variables so that mouse and keyboard activity more readily boots a running job. Then, test theSUSPEND
,VACATE
, andKILL
settings. - SSH activity on a machine resets KeyboardIdle but not ConsoleIdle. Actual keyboard activity resets both. Fix this so that SSH activity resets ConsoleIdle only and keyboard activity (and mouse activity) resets KeyboardIdle only.
- Create “MATLAB” requirement on all of the machines with MATLAB (local config file) to designate which machines have MATLAB. See Section 2.5.2 “About Requirements and Rank” (pg. 21) in the Condor documentation.
- Limit the number of MATLAB jobs to a certain number (based on number of MATLAB licenses we have).
- Install python and python3 from source with
condor_compile
(compile aspython3condor
andidle3condor
, for example) to be able to checkpoint any Python program (if the user specifiespython3condor
as the executable). - Install Octave (free, open-source MATLAB look-alike) from source with
condor_compile
to be able to checkpoint basic (portable) MATLAB code. - Install and configure MPICH (add secret passphrase to that one file?)
- Configure
condor_cod
“Computing on Demand”, such as to run full graphical MATLAB in parallel- Configure MATLAB on The Stable to use Condor
- Run and configure
condor_credd
for secure user credentials storage (for Windows machines only?). - Add Windows VMs to Condor workers (Install xen, since in
yum
? but VMware more user friendly)
News
Items listed here are completed items from the list above. Items should be listed in earliest-to-latest order (more or less) and appended with the signature of the user who completed the task. Note: Cross out unimportant items.
- Put Global Config file in
/mnt/config/hosts/condor_config_global
- Add
condor.cs.wlu.edu
to DNS to refer tofred.cs.wlu.edu
. Change some references from(Undone)fred.cs.wlu.edu
tocondor.cs.wlu.edu
?- Made two types of local config files: role (master/worker) and host-specific. You can specify more than one local configuration file in
LOCAL_CONFIG_FILE
(comma separated?). Use~/condor/Settings Tree.mm
for reference. - Add
AddHost.sh
script in /mnt/config/hosts to create directories for a new host. Reserve a UID and GID from an invalidUsed really big numbers instead of actually reserving UID and GID.condor
user onterras
. Make the localcondor
users use that dummy UID and GID. Compare/etc/passwd
from all pool machines and fromterras
to find a suitable UID and GID forcondor
. All machines should use the sameCONDOR_IDS
for the sake of file permission management on the NAS.- Set correct
CONDOR_IDS
config variable on all machines, whether they be globally or locally defined. - Remove
/etc/condor
directories. - UNdefine
CONDOR_CONFIG
environment variable for all members of the Condor pool. Global environment variables a bad idea? - Add a symbolic link at
/var/lib/condor/condor_config
to link to the global configuration file on the NAS. This location is considered to be the home folder of thecondor
users. - Put
/etc/init.d/condor
file on the NAS? Networking should be started by the time it is run. - Implement an authentication protocol as described in Section 3.6.3 “Authentication” in the Condor documentation (pg. 323).
- Configured user-based authorization (instead of host-based) as described in Section 3.6.7 “Authorization” of the Condor documentation (pg. 337).
- According to the Condor documentation, the “Map File” of the user based authorization allows us to map various authorization techniques to canonical usernames and hostnames.
- Fixed the following error in the MasterLog of
carl
andfred
by implementing user-based authentication instead of host-wide authentication. Also added password authentication to enhance security of system. This fixed the following error:- MasterLog
PERMISSION DENIED to unauthenticated@unmapped from host 137.113.118.64 for command 60008 (DC_CHILDALIVE), access level DAEMON: reason: DAEMON authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 137.113.118.64,carl.cs.wlu.edu,carl
- Fixed “no network route” error in NegotiatorLog:
- NegotiatorLog
07/21/11 16:25:21 attempt to connect to <137.113.118.65:52652> failed: No route to host (connect errno = 113). 07/21/11 16:25:21 ERROR: SECMAN:2004:Failed to create security session to <137.113.118.65:52652> with TCP. |SECMAN:2003:TCP connection to <137.113.118.65:52652> failed. 07/21/11 16:25:21 Failed to initiate socket to send MATCH_INFO to slot1@fred.cs.wlu.edu
Opened port range on firewall on pool member machines for Condor to use.
Undid TestingMode changes to global configuration file (look for TODO's).- Moved
KeyboardIdle
condition to theRANK
configuration. That way, theSTART
condition won't depend on keyboard idle time but a longer idle time will make jobs more likely to run on that slot. - Disabled host-wide
ALLOW_CONFIG
access tojohn.cs.wlu.edu
. I did this only for testing purposes, but it was a rather large security hole to allow anyone who could log intojohn
to change the configuration settings. Before this, thecondor_config_val
program didn't want to authenticate no matter what I did and so always presented itself as “unauthenticated@unmapped” to thecondor_master
. Now that I changed the security settings (SEC_*_AUTHENTICATION
), thecondor_config_val
program now wants to authenticate and properly presents itself as “koller@cs.wlu.edu”. - Added
babbage.cs.wlu.edu
to the Condor pool by installing Condor on it with thewlu-cs-condor-install.sh
script. Because of this, I was able to test the script for a clean-install and fix some bugs that I noticed. The script didn't create the symbolic link for the shared configuration file, for example, so I fixed that. The script also needed to create the/var/run/condor
directory for the boot script to store the pidfile there, so I added it to the script. — Garrett Koller 2011/07/26 12:09 - Put (or link?) all install- and configuration-related scripts in a centralized folder:
/mnt/config/scripts/
. — Garrett Koller 2011/07/27 15:30- Modified the install script in
/mnt/config/scripts/
to prompt for the operating system and processor architecture and then pick the appropriate release directory to install from. — Garrett Koller 2011/07/27 15:31
- Moved the
AddHost.sh
script to the/mnt/config/scripts/
folder.Try using the— Garrett Koller 2011/07/27 15:30uname -p
command to just make the script universal and centralized. - Made uninstall script called
wlu-cs-condor-uninstall.sh
to undo everything inwlu-cs-condor-install.sh
— Garrett Koller 2011/07/27 15:31 - Changed the
condor
UID and GID to 64. Since 64 is in the system ID range, it will not be assigned to normal users. Furthermore, Condor might (or might not) be “well-known” to use 64 for its UID and GID. 64 is “well-known” to be used for thecondor
user and group according to the Red Hat Enterprise Linux 6 Deployment Guide. In order to change the ownership of all of the oldcondor
user and group files, I ran the following script on all of the machines in the Condor pool:- ChangeCondorID.sh
for file in `find / -uid 1344` do sudo chown -v condor "$file" done for file in `find / -gid 1610` do chown -v :condor "$file" done
— Garrett Koller 2011/07/28 15:03
- Modified install script and uninstall script to automatically add (and remove, respectively) the hostname to the
PoolMembers
variable in the global configuration file. I did this with a few simplesed
commands. — Garrett Koller 2011/07/29 14:41 RuncheckPrimes
program on Condor and test checkpointing functionality.Force a machine (or slot) to checkpoint its jobsTrycondor_qedit
to edit the classads of jobs on the Condor queueTrycondor checkpoint
Trycondor_config_val
to manually control preemptions and checkpoints
Make sure SUSPEND and PREEMPT (with checkpointing) work.
- It turns out that even though a slot listed in
condor_status
may say it has 1.5 Gigs, a job's RAM usage is not limited to this number. Instead, this number seems to come into play only during the negotiation phase to figure out where the job will run and only if the user who submitted the job bothered to mention the amount of RAM their job would need in theRequirements
variable and possibly theImageSize
variable (which is more like the checkpoint file's size). - Fixed
RANK
expression error.RANK
refers to the machine's preferences for jobs that run on it, so theOwner
variable is applicable since it is defined in the job's ClassAd.DEFAULT_RANK
, on the other hand, refers to how a job ranks the machines it wants to run on, so theSlotID
,KeyboardIdle
, andLoadAvg
variables are applicable since they are defined in the machine's ClassAd. - Got the standard universe to work and it seems to work well with programs that were compiled with
condor_compile
. Thus, checkpointing also works. - Created the Checkpointing Programs page for instruction on how to use the
condor_compile
command. — Garrett Koller 2011/08/05 17:34 DEFAULT_RANK
is now configured to use the “last” CPUs before using the “first” CPUs on a machine, since “first” CPUs are used by the user first.- Got the vanilla universe to work again. Before, programs in the vanilla universe would run but then hang after transferring its output.
- The ShadowLog mentioned a bunch of
SetEffectiveOwner(kollerg) failed with errno=13: Permission denied.
. I enabled full debugging for the Shadow daemon, but couldn't find the source of the error. - The SchedLog mentioned a bunch of “
OwnerCheck(condor_pool) failed in SetAttribute for job 41.0
”. I enabled full debugging for the Scheduler daemon, but no luck there either. - Finally, I decided to reset all of the configuration variables in condor_config_global back to their default values and change them back one-by-one to what it was to see where the error came from. I discovered that the error had to do with the incomplete authentication that was going on by Condor preferring (incomplete) password authentication as opposed to the what should have been (complete) username authentication with
FS
andFS_REMOTE
. I then simply fixed the authentication to try full username authentication first in order to authenticate processes based on their user ID instead of their oh-I-happen-to-have-a-password-but-you-can't-verify-who-I-am-specifically-ness. For more info, see Authentication in Condor.
- Finished the Authentication in Condor page. Explained the downsides of password-based authentication (→ “condor_pool@cs.wlu.edu”). — Garrett Koller 2011/08/09 16:04
- FINALLY got MATLAB to run in the vanilla universe. I came across two major problems that I had to figure out: — Garrett Koller 2011/08/09 17:17
- The value of the Arguments variable in the submit file must be surrounded by double quotes. (Quotes within that must be escaped with a backslash.) For executing a top level function from the command line using the argument
-r <MATLAB_Command>
, “<MATLAB_Command>
” must be surrounded by single quotes (they don't need to be escaped). So, to run theHellowOrld()
function and exit, the Arguments variable in the submit file must be as follows:Arguments = "-nodisplay -r 'HellowOrld(); exit;'"
- In order to transfer a MATLAB function file (*.m) with the submission, I had to turn on file transferring. This inadvertently also transfers the MATLAB executable, even if you give the whole pathname of the MATLAB executable (
/usr/local/bin/matlab
). MATLAB doesn't like not running from deep within the bowels of the/usr/local/
directory, so to tell Condor to transfer input and output files normally but not transfer the executable (therefore assuming that it's present on the execute machine), the following variable must be set in the submit file:transfer_executable = false
- Configured john to be a CondorView Server (Condor historical statistics web application)
- Made
condor.cs.wlu.edu/index.html
redirect to the Condor Wiki homepage while still allowing access to CondorView. - Added link to CondorView within Wiki.
- Made a mail mapping file (a pickled Python dictionary) to map CS Unix usernames to W&L email addresses (@mail.wlu.edu). Used this in the Python script to add the email address to the job submission automatically unless the user specifies another email address. — Garrett Koller 2011/08/16 16:59
- Instead of lying to Condor about how many CPUs each computer has to be half of what it actually is (to account for hyperthreaded CPUs), I set
COUNT_HYPERTHREAD_CPUS = False
in the global configuration to fix this problem. Now, we have better response for the interactive user while also allowing a job running on a slot to run faster since it is less likely to be competing with another Condor job on a complementary hyperthread on the same CPU. — Garrett Koller 2011/08/17 11:33 - Implemented
Job.status()
,Job.poll()
, andJob.wait()
in thecondor
Python module to let the user know if and when their jobs are done. — Garrett Koller 2011/08/18 19:10 - Enabled dynamic slot allocation so that jobs are given only the resources they need. — Garrett Koller 2011/08/19 16:07
- Redefined
NonCondorLoadAvg
in global config file to includeCpus
(LoadAvg
-CondorLoadAvg
/Cpus
=NonCondorLoadAvg
), assuming that everything that usesNonCondorLoadAvg
has access to theCpus
Startd variable. This variable is used in theSTART
expression so that a computer with a lot of cores that are each a tiny bit busy will still be able to accept jobs. — Garrett Koller 2011/08/27 22:48 - Configured and enabled Backfill jobs — Garrett Koller 2011/09/12 17:03
- In install script, ask to install and enable backfill jobs. If so, install boinc-client and add
ENABLE_BACKFILL = TRUE
to local (host-specific) configuration file. - Added
yum install boinc-client
(newer version, x86_64 architecture) to the install script.
- Added
setRAM()
,setCPUNum()
, andsetDisk()
tocondor.py
allow jobs to request for its resources. If the user doesn't specify, it is assumed that each job needs 1 CPU, 1 Gig of RAM, and 32 MB of disk space. — Garrett Koller 2011/09/13 17:45 - Added more comments for the sake of documentation. This documentation can be accessed by the user with the Python
help()
function, such as by runninghelp(Condor.Job)
. In Python, first line of the docstrings serves as the text in the pop-up tooltip that appears in Idle when a user starts typing in the arguments of the function. I made that text be a quick snippet of each function's syntax. — Garrett Koller 2011/09/15 17:36 - Created Python script to edit the email mapping (pickled) file. Added it to the
condor-py-scripting
Google Code project. — Garrett Koller 2011/09/15 17:36 - Finalize R. E. Lee Research — Garrett Koller 2011/10/01 13:26
- Print poster
- Downloaded
condor-7.6.4-x86_64_rhap_6.1-updated-stripped
source and installed it into a new release directory. — Garrett Koller 2011/12/01 12:09