====== Condor TODO List ====== === Note === DELETEME :!: Note: This page is deprecated. The __new__ TODO list is located on [[condor:administration:todo|this Tasks To-Do List]]. :!: ===== Pending ===== * :!: Configure firewalls to open port range 9600-9700 on all Condor computers. * Install [[http://mmonit.com/monit/|Monit]] to monitor system services and processes to ensure that they are running at all times. Configure it so that if Condor crashes, Monit will automagically restart them. * :!: Update links to point to newer Condor v. 7.6.4 release directory instead of old v. 7.6.1. * :!: Set up ''john'' to be an email server so that ''condor-admin@condor.cs.wlu.edu'' can receive mail (and just forward it to me). * Configure mail rerouting so that outgoing mail will be rerouted to the ''sendmail'' service on ''terras''. That way, the ''sendmail'' service doesn't have to run all the time. * Set up basic email server for ''condor@condor.cs.wlu.edu'' to send Condor mail and ''condor-admin@condor.cs.wlu.edu'' to receive mail (and forward to me). * Change public email address to be ''condor-admin@condor.cs.wlu.edu'' in mail map file, wiki homepage, Condor configuration file, and installation script. * Get parallel universe to work * Install MPI on all of the nodes? * Get Java universe to work * For Dr. Lambert's [[http://home.wlu.edu/~lambertk/classes/320/|CS 320 Parallel Computing]] class * Enable "no root squashing" on NAS to prevent local ''root'' users from modifying files. * Make Condor's slot allocation "nicer". * Fix the ''RANK'' function so that jobs are allocated in a breadth-first manner instead of a depth-first manner. Evenly distribute the work between machines. Ex: 4 submitted jobs on 4 machines with 2 slots each should run with one job on each of the 4 machines instead of all the jobs running on the first two machines and none on the second two machines. Use the **''TotalLoadAvg''** variable in the ''RANK'' function to take the total CPU usage into account. * :!: Implement DAGMan to allow Lee to introduce ordering within the job submissions. * :!: Create custom condor.py script to help users easily submit their jobs * Remove "local" argument in the ''Shell'' object ''__init__'' method. (See TODO there.) * Use kwargs or the command module to make a ''queueMany(argsLst, outLst, inLst, logLst, ...)'' method in the Condor module with a variable number of arguments. * Start an interactive session after ''Condor.submit()'' is run so that the user can (manually or automatically) poll for the current status of their jobs, remove the jobs, etc. * List and answer questions asked in the beginning of the installation section of the Condor manual. Add short reasons why you chose this and link any install pages that offer more detail. Put this in the "Installation homepage" for the installation section (it's too empty anyway). * Make "How to" page about how to configure the Apache web server and install CondorView onto it. * Make "How to" pages for how to submit jobs. * A bunch of MATLABs from a list of strings. * Change ''babbage'''s configuration variables so that mouse and keyboard activity more readily boots a running job. Then, test the ''SUSPEND'', ''VACATE'', and ''KILL'' settings. * SSH activity on a machine resets KeyboardIdle but not ConsoleIdle. Actual keyboard activity resets both. Fix this so that SSH activity resets ConsoleIdle only and keyboard activity (and mouse activity) resets KeyboardIdle only. * Create "MATLAB" requirement on all of the machines with MATLAB (local config file) to designate which machines have MATLAB. See Section 2.5.2 "About Requirements and Rank" (pg. 21) in the Condor documentation. * Limit the number of MATLAB jobs to a certain number (based on number of MATLAB licenses we have). * Install python and python3 from source with ''condor_compile'' (compile as ''python3condor'' and ''idle3condor'', for example) to be able to checkpoint any Python program (if the user specifies ''python3condor'' as the executable). * Install [[http://www.gnu.org/software/octave/|Octave]] (free, open-source MATLAB look-alike) from source with ''condor_compile'' to be able to checkpoint basic (portable) MATLAB code. * Install and configure MPICH (add secret passphrase to that one file?) * :!: Configure ''condor_cod'' "Computing on Demand", such as to run full graphical MATLAB in parallel * Configure MATLAB on The Stable to use Condor * Run and configure ''condor_credd'' for secure user credentials storage (for Windows machines only?). * Add Windows VMs to Condor workers (Install xen, since in ''yum''? but VMware more user friendly) ===== News ===== Items listed here are completed items from the list above. Items should be listed in earliest-to-latest order (more or less) and appended with the signature of the user who completed the task. :!: Note: Cross out unimportant items. * Put Global Config file in ''/mnt/config/hosts/condor_config_global'' * Add ''condor.cs.wlu.edu'' to DNS to refer to ''fred.cs.wlu.edu''. * Change some references from ''fred.cs.wlu.edu'' to ''condor.cs.wlu.edu''? (Undone) * Made two types of local config files: role (master/worker) and host-specific. You can specify more than one local configuration file in ''LOCAL_CONFIG_FILE'' (comma separated?). Use ''~/condor/Settings Tree.mm'' for reference. * Add ''AddHost.sh'' script in /mnt/config/hosts to create directories for a new host. * Reserve a UID and GID from an invalid ''condor'' user on ''terras''. Make the local ''condor'' users use that dummy UID and GID. Compare ''/etc/passwd'' from all pool machines and from ''terras'' to find a suitable UID and GID for ''condor''. All machines should use the //same// ''CONDOR_IDS'' for the sake of file permission management on the NAS. Used really big numbers instead of actually reserving UID and GID. * Set correct ''CONDOR_IDS'' config variable on all machines, whether they be globally or locally defined. * Remove ''/etc/condor'' directories. * UNdefine ''CONDOR_CONFIG'' environment variable for all members of the Condor pool. Global environment variables a bad idea? * Add a symbolic link at ''/var/lib/condor/condor_config'' to link to the global configuration file on the NAS. This location is considered to be the home folder of the ''condor'' users. * Put ''/etc/init.d/condor'' file on the NAS? Networking should be started by the time it is run. * Implement an authentication protocol as described in Section 3.6.3 "Authentication" in the Condor documentation (pg. 323). * Configured user-based authorization (instead of host-based) as described in Section 3.6.7 "Authorization" of the Condor documentation (pg. 337). * According to the Condor documentation, the "Map File" of the user based authorization allows us to map various authorization techniques to canonical usernames and hostnames. * Fixed the following error in the MasterLog of ''carl'' and ''fred'' by implementing user-based authentication instead of host-wide authentication. Also added password authentication to enhance security of system. This fixed the following error: PERMISSION DENIED to unauthenticated@unmapped from host 137.113.118.64 for command 60008 (DC_CHILDALIVE), access level DAEMON: reason: DAEMON authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 137.113.118.64,carl.cs.wlu.edu,carl * Fixed "no network route" error in NegotiatorLog: 07/21/11 16:25:21 attempt to connect to <137.113.118.65:52652> failed: No route to host (connect errno = 113). 07/21/11 16:25:21 ERROR: SECMAN:2004:Failed to create security session to <137.113.118.65:52652> with TCP. |SECMAN:2003:TCP connection to <137.113.118.65:52652> failed. 07/21/11 16:25:21 Failed to initiate socket to send MATCH_INFO to slot1@fred.cs.wlu.edu Opened port range on firewall on pool member machines for Condor to use. * Undid TestingMode changes to global configuration file (look for TODO's). * Moved ''KeyboardIdle'' condition to the ''RANK'' configuration. That way, the ''START'' condition won't depend on keyboard idle time but a longer idle time will make jobs more likely to run on that slot. * Disabled host-wide ''ALLOW_CONFIG'' access to ''john.cs.wlu.edu''. I did this only for testing purposes, but it was a rather large security hole to allow anyone who could log into ''john'' to change the configuration settings. Before this, the ''condor_config_val'' program didn't want to authenticate no matter what I did and so always presented itself as "unauthenticated@unmapped" to the ''condor_master''. Now that I changed the security settings (''SEC_*_AUTHENTICATION''), the ''condor_config_val'' program now wants to authenticate and properly presents itself as "koller@cs.wlu.edu". * Added ''babbage.cs.wlu.edu'' to the Condor pool by installing Condor on it with the ''wlu-cs-condor-install.sh'' script. Because of this, I was able to test the script for a clean-install and fix some bugs that I noticed. The script didn't create the symbolic link for the shared configuration file, for example, so I fixed that. The script also needed to create the ''/var/run/condor'' directory for the boot script to store the pidfile there, so I added it to the script. --- //[[kollerg14@mail.wlu.edu|Garrett Koller]] 2011/07/26 12:09// * Put (or link?) all install- and configuration-related scripts in a centralized folder: ''/mnt/config/scripts/''. --- //[[kollerg14@mail.wlu.edu|Garrett Koller]] 2011/07/27 15:30// * Modified the install script in ''/mnt/config/scripts/'' to prompt for the operating system and processor architecture and then pick the appropriate release directory to install from. --- //[[kollerg14@mail.wlu.edu|Garrett Koller]] 2011/07/27 15:31// * Moved the ''AddHost.sh'' script to the ''/mnt/config/scripts/'' folder. Try using the ''uname -p'' command to just make the script universal and centralized. --- //[[kollerg14@mail.wlu.edu|Garrett Koller]] 2011/07/27 15:30// * Made uninstall script called ''wlu-cs-condor-uninstall.sh'' to undo everything in ''wlu-cs-condor-install.sh'' --- //[[kollerg14@mail.wlu.edu|Garrett Koller]] 2011/07/27 15:31// * Changed the ''condor'' UID and GID to 64. Since 64 is in the system ID range, it will not be assigned to normal users. Furthermore, Condor might (or might not) be "well-known" to use 64 for its UID and GID. 64 is "well-known" to be used for the ''condor'' user and group according to the {{http://www.linuxtopia.org/online_books/rhel6/rhel_6_deployment/rhel_6_deployment_s1-users-groups-standard-users.html|Red Hat Enterprise Linux 6 Deployment Guide}}. In order to change the ownership of all of the old ''condor'' user and group files, I ran the following script on all of the machines in the Condor pool: for file in `find / -uid 1344` do sudo chown -v condor "$file" done for file in `find / -gid 1610` do chown -v :condor "$file" done --- //[[kollerg14@mail.wlu.edu|Garrett Koller]] 2011/07/28 15:03// * Modified install script and uninstall script to automatically add (and remove, respectively) the hostname to the ''PoolMembers'' variable in the global configuration file. I did this with a few simple ''sed'' commands. --- //[[kollerg14@mail.wlu.edu|Garrett Koller]] 2011/07/29 14:41// * :!: Run ''checkPrimes'' program on Condor and test checkpointing functionality. * Force a machine (or slot) to checkpoint its jobs * Try ''condor_qedit'' to edit the classads of jobs on the Condor queue * Try ''condor checkpoint'' * Try ''condor_config_val'' to manually control preemptions and checkpoints * Make sure SUSPEND and PREEMPT (with checkpointing) work. * It turns out that even though a slot listed in ''condor_status'' may say it has 1.5 Gigs, a job's RAM usage is not limited to this number. Instead, this number seems to come into play only during the negotiation phase to figure out where the job will run and **only if** the user who submitted the job bothered to mention the amount of RAM their job would need in the ''Requirements'' variable and possibly the ''ImageSize'' variable (which is more like the checkpoint file's size). * Fixed ''RANK'' expression error. ''RANK'' refers to the machine's preferences for jobs that run on it, so the ''Owner'' variable is applicable since it is defined in the **job's** ClassAd. ''DEFAULT_RANK'', on the other hand, refers to how a job ranks the machines it wants to run on, so the ''SlotID'', ''KeyboardIdle'', and ''LoadAvg'' variables are applicable since they are defined in the **machine's** ClassAd. * Got the **standard** universe to work and it seems to work well with programs that were compiled with ''condor_compile''. Thus, checkpointing also works. * Created the [[condor:submit:checkpointing|Checkpointing Programs]] page for instruction on how to use the ''condor_compile'' command. --- //[[kollerg14@mail.wlu.edu|Garrett Koller]] 2011/08/05 17:34// * ''DEFAULT_RANK'' is now configured to use the "last" CPUs before using the "first" CPUs on a machine, since "first" CPUs are used by the user first. * Got the **vanilla** universe to work again. Before, programs in the vanilla universe would run but then hang after transferring its output. * The ShadowLog mentioned a bunch of ''SetEffectiveOwner(kollerg) failed with errno=13: Permission denied.''. I enabled full debugging for the Shadow daemon, but couldn't find the source of the error. * The SchedLog mentioned a bunch of "''OwnerCheck(condor_pool) failed in SetAttribute for job 41.0''". I enabled full debugging for the Scheduler daemon, but no luck there either. * Finally, I decided to reset all of the configuration variables in {{:condor:installation:condor_config_global.txt|condor_config_global}} back to their default values and change them back one-by-one to what it was to see where the error came from. I discovered that the error had to do with the incomplete authentication that was going on by Condor preferring (incomplete) password authentication as opposed to the what should have been (complete) username authentication with ''FS'' and ''FS_REMOTE''. I then simply fixed the authentication to try full username authentication first in order to authenticate processes based on their user ID instead of their oh-I-happen-to-have-a-password-but-you-can't-verify-who-I-am-specifically-ness. For more info, see [[condor:administration:authentication|Authentication in Condor]]. * Finished the [[condor:administration:authentication|Authentication in Condor]] page. Explained the downsides of password-based authentication (-> "condor_pool@cs.wlu.edu"). --- //[[kollerg14@mail.wlu.edu|Garrett Koller]] 2011/08/09 16:04// * FINALLY got MATLAB to run in the vanilla universe. I came across two major problems that I had to figure out: --- //[[kollerg14@mail.wlu.edu|Garrett Koller]] 2011/08/09 17:17// - The value of the Arguments variable in the submit file must be surrounded by double quotes. (Quotes within that must be escaped with a backslash.) For executing a top level function from the command line using the argument ''-r '', "''''" must be surrounded by single quotes (they don't need to be escaped). So, to run the ''HellowOrld()'' function and exit, the Arguments variable in the submit file must be as follows: Arguments = "-nodisplay -r 'HellowOrld(); exit;'" - In order to transfer a MATLAB function file (*.m) with the submission, I had to turn on file transferring. This inadvertently also transfers the MATLAB executable, even if you give the whole pathname of the MATLAB executable (''/usr/local/bin/matlab''). MATLAB doesn't like not running from deep within the bowels of the ''/usr/local/'' directory, so to tell Condor to transfer input and output files normally but __not__ transfer the executable (therefore assuming that it's present on the execute machine), the following variable must be set in the submit file: transfer_executable = false * Configured john to be a CondorView Server (Condor historical statistics web application) * Made ''condor.cs.wlu.edu/index.html'' redirect to the Condor Wiki homepage while still allowing access to CondorView. * Added link to CondorView within Wiki. * Made a mail mapping file (a pickled Python dictionary) to map CS Unix usernames to W&L email addresses (@mail.wlu.edu). Used this in the Python script to add the email address to the job submission automatically unless the user specifies another email address. --- //[[kollerg14@mail.wlu.edu|Garrett Koller]] 2011/08/16 16:59// * Instead of lying to Condor about how many CPUs each computer has to be half of what it actually is (to account for hyperthreaded CPUs), I set ''COUNT_HYPERTHREAD_CPUS = False'' in the global configuration to fix this problem. Now, we have better response for the interactive user while also allowing a job running on a slot to run faster since it is less likely to be competing with another Condor job on a complementary hyperthread on the same CPU. --- //[[kollerg14@mail.wlu.edu|Garrett Koller]] 2011/08/17 11:33// * Implemented ''Job.status()'', ''Job.poll()'', and ''Job.wait()'' in the ''condor'' Python module to let the user know if and when their jobs are done. --- //[[kollerg14@mail.wlu.edu|Garrett Koller]] 2011/08/18 19:10// * Enabled dynamic slot allocation so that jobs are given only the resources they need. --- //[[kollerg14@mail.wlu.edu|Garrett Koller]] 2011/08/19 16:07// * Redefined ''NonCondorLoadAvg'' in global config file to include ''Cpus'' (''LoadAvg'' - ''CondorLoadAvg'' **/ ''Cpus''** = ''NonCondorLoadAvg''), assuming that everything that uses ''NonCondorLoadAvg'' has access to the ''Cpus'' Startd variable. This variable is used in the ''START'' expression so that a computer with a lot of cores that are each a tiny bit busy will still be able to accept jobs. --- //[[kollerg14@mail.wlu.edu|Garrett Koller]] 2011/08/27 22:48// * Configured and enabled Backfill jobs --- //[[kollerg14@mail.wlu.edu|Garrett Koller]] 2011/09/12 17:03// * In install script, ask to install and enable backfill jobs. If so, install boinc-client and add ''ENABLE_BACKFILL = TRUE'' to local (host-specific) configuration file. * Added ''yum install boinc-client'' (newer version, x86_64 architecture) to the install script. * Added ''setRAM()'', ''setCPUNum()'', and ''setDisk()'' to ''condor.py'' allow jobs to request for its resources. If the user doesn't specify, it is assumed that each job needs 1 CPU, 1 Gig of RAM, and 32 MB of disk space. --- //[[kollerg14@mail.wlu.edu|Garrett Koller]] 2011/09/13 17:45// * Added more comments for the sake of documentation. This documentation can be accessed by the user with the Python ''help()'' function, such as by running ''help(Condor.Job)''. In Python, first line of the docstrings serves as the text in the pop-up tooltip that appears in Idle when a user starts typing in the arguments of the function. I made that text be a quick snippet of each function's syntax. --- //[[kollerg14@mail.wlu.edu|Garrett Koller]] 2011/09/15 17:36// * Created Python script to edit the email mapping (pickled) file. Added it to the ''condor-py-scripting'' Google Code project. --- //[[kollerg14@mail.wlu.edu|Garrett Koller]] 2011/09/15 17:36// * Finalize R. E. Lee Research --- //[[kollerg14@mail.wlu.edu|Garrett Koller]] 2011/10/01 13:26// * Print poster * Downloaded ''condor-7.6.4-x86_64_rhap_6.1-updated-stripped'' source and installed it into a new release directory. --- //[[kollerg14@mail.wlu.edu|Garrett Koller]] 2011/12/01 12:09//