Table of Contents

Condor TODO List

Note

DELETEME

:!: Note: This page is deprecated. The new TODO list is located on this Tasks To-Do List. :!:

Pending

  • :!: Configure firewalls to open port range 9600-9700 on all Condor computers.
  • Install Monit to monitor system services and processes to ensure that they are running at all times. Configure it so that if Condor crashes, Monit will automagically restart them.
  • :!: Update links to point to newer Condor v. 7.6.4 release directory instead of old v. 7.6.1.
  • :!: Set up john to be an email server so that condor-admin@condor.cs.wlu.edu can receive mail (and just forward it to me).
    • Configure mail rerouting so that outgoing mail will be rerouted to the sendmail service on terras. That way, the sendmail service doesn't have to run all the time.
    • Set up basic email server for condor@condor.cs.wlu.edu to send Condor mail and condor-admin@condor.cs.wlu.edu to receive mail (and forward to me).
    • Change public email address to be condor-admin@condor.cs.wlu.edu in mail map file, wiki homepage, Condor configuration file, and installation script.
  • Get parallel universe to work
    • Install MPI on all of the nodes?
  • Get Java universe to work
  • Enable “no root squashing” on NAS to prevent local root users from modifying files.
  • Make Condor's slot allocation “nicer”.
    • Fix the RANK function so that jobs are allocated in a breadth-first manner instead of a depth-first manner. Evenly distribute the work between machines. Ex: 4 submitted jobs on 4 machines with 2 slots each should run with one job on each of the 4 machines instead of all the jobs running on the first two machines and none on the second two machines. Use the TotalLoadAvg variable in the RANK function to take the total CPU usage into account.
  • :!: Implement DAGMan to allow Lee to introduce ordering within the job submissions.
  • :!: Create custom condor.py script to help users easily submit their jobs
    • Remove “local” argument in the Shell object __init__ method. (See TODO there.)
    • Use kwargs or the command module to make a queueMany(argsLst, outLst, inLst, logLst, …) method in the Condor module with a variable number of arguments.
    • Start an interactive session after Condor.submit() is run so that the user can (manually or automatically) poll for the current status of their jobs, remove the jobs, etc.
  • List and answer questions asked in the beginning of the installation section of the Condor manual. Add short reasons why you chose this and link any install pages that offer more detail. Put this in the “Installation homepage” for the installation section (it's too empty anyway).
  • Make “How to” page about how to configure the Apache web server and install CondorView onto it.
  • Make “How to” pages for how to submit jobs.
    • A bunch of MATLABs from a list of strings.
  • Change babbage's configuration variables so that mouse and keyboard activity more readily boots a running job. Then, test the SUSPEND, VACATE, and KILL settings.
  • SSH activity on a machine resets KeyboardIdle but not ConsoleIdle. Actual keyboard activity resets both. Fix this so that SSH activity resets ConsoleIdle only and keyboard activity (and mouse activity) resets KeyboardIdle only.
  • Create “MATLAB” requirement on all of the machines with MATLAB (local config file) to designate which machines have MATLAB. See Section 2.5.2 “About Requirements and Rank” (pg. 21) in the Condor documentation.
    • Limit the number of MATLAB jobs to a certain number (based on number of MATLAB licenses we have).
  • Install python and python3 from source with condor_compile (compile as python3condor and idle3condor, for example) to be able to checkpoint any Python program (if the user specifies python3condor as the executable).
  • Install Octave (free, open-source MATLAB look-alike) from source with condor_compile to be able to checkpoint basic (portable) MATLAB code.
  • Install and configure MPICH (add secret passphrase to that one file?)
  • :!: Configure condor_cod “Computing on Demand”, such as to run full graphical MATLAB in parallel
    • Configure MATLAB on The Stable to use Condor
  • Run and configure condor_credd for secure user credentials storage (for Windows machines only?).
  • Add Windows VMs to Condor workers (Install xen, since in yum? but VMware more user friendly)

News

Items listed here are completed items from the list above. Items should be listed in earliest-to-latest order (more or less) and appended with the signature of the user who completed the task. :!: Note: Cross out unimportant items.

  • Put Global Config file in /mnt/config/hosts/condor_config_global
  • Add condor.cs.wlu.edu to DNS to refer to fred.cs.wlu.edu.
  • Change some references from fred.cs.wlu.edu to condor.cs.wlu.edu? (Undone)
  • Made two types of local config files: role (master/worker) and host-specific. You can specify more than one local configuration file in LOCAL_CONFIG_FILE (comma separated?). Use ~/condor/Settings Tree.mm for reference.
  • Add AddHost.sh script in /mnt/config/hosts to create directories for a new host.
  • Reserve a UID and GID from an invalid condor user on terras. Make the local condor users use that dummy UID and GID. Compare /etc/passwd from all pool machines and from terras to find a suitable UID and GID for condor. All machines should use the same CONDOR_IDS for the sake of file permission management on the NAS. Used really big numbers instead of actually reserving UID and GID.
  • Set correct CONDOR_IDS config variable on all machines, whether they be globally or locally defined.
  • Remove /etc/condor directories.
  • UNdefine CONDOR_CONFIG environment variable for all members of the Condor pool. Global environment variables a bad idea?
  • Add a symbolic link at /var/lib/condor/condor_config to link to the global configuration file on the NAS. This location is considered to be the home folder of the condor users.
  • Put /etc/init.d/condor file on the NAS? Networking should be started by the time it is run.
  • Implement an authentication protocol as described in Section 3.6.3 “Authentication” in the Condor documentation (pg. 323).
  • Configured user-based authorization (instead of host-based) as described in Section 3.6.7 “Authorization” of the Condor documentation (pg. 337).
    • According to the Condor documentation, the “Map File” of the user based authorization allows us to map various authorization techniques to canonical usernames and hostnames.
    • Fixed the following error in the MasterLog of carl and fred by implementing user-based authentication instead of host-wide authentication. Also added password authentication to enhance security of system. This fixed the following error:
      MasterLog
      PERMISSION DENIED to unauthenticated@unmapped from host 137.113.118.64 for command 60008 (DC_CHILDALIVE), access level DAEMON: reason: DAEMON authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 137.113.118.64,carl.cs.wlu.edu,carl
  • Fixed “no network route” error in NegotiatorLog:
    NegotiatorLog
    07/21/11 16:25:21 attempt to connect to <137.113.118.65:52652> failed: No route to host (connect errno = 113).
    07/21/11 16:25:21 ERROR: SECMAN:2004:Failed to create security session to <137.113.118.65:52652> with TCP.
    |SECMAN:2003:TCP connection to <137.113.118.65:52652> failed.
    07/21/11 16:25:21       Failed to initiate socket to send MATCH_INFO to slot1@fred.cs.wlu.edu

    Opened port range on firewall on pool member machines for Condor to use.

  • Undid TestingMode changes to global configuration file (look for TODO's).
  • Moved KeyboardIdle condition to the RANK configuration. That way, the START condition won't depend on keyboard idle time but a longer idle time will make jobs more likely to run on that slot.
  • Disabled host-wide ALLOW_CONFIG access to john.cs.wlu.edu. I did this only for testing purposes, but it was a rather large security hole to allow anyone who could log into john to change the configuration settings. Before this, the condor_config_val program didn't want to authenticate no matter what I did and so always presented itself as “unauthenticated@unmapped” to the condor_master. Now that I changed the security settings (SEC_*_AUTHENTICATION), the condor_config_val program now wants to authenticate and properly presents itself as “koller@cs.wlu.edu”.
  • Added babbage.cs.wlu.edu to the Condor pool by installing Condor on it with the wlu-cs-condor-install.sh script. Because of this, I was able to test the script for a clean-install and fix some bugs that I noticed. The script didn't create the symbolic link for the shared configuration file, for example, so I fixed that. The script also needed to create the /var/run/condor directory for the boot script to store the pidfile there, so I added it to the script. — Garrett Koller 2011/07/26 12:09
  • Put (or link?) all install- and configuration-related scripts in a centralized folder: /mnt/config/scripts/. — Garrett Koller 2011/07/27 15:30
    • Modified the install script in /mnt/config/scripts/ to prompt for the operating system and processor architecture and then pick the appropriate release directory to install from. — Garrett Koller 2011/07/27 15:31
  • Moved the AddHost.sh script to the /mnt/config/scripts/ folder. Try using the uname -p command to just make the script universal and centralized.Garrett Koller 2011/07/27 15:30
  • Made uninstall script called wlu-cs-condor-uninstall.sh to undo everything in wlu-cs-condor-install.shGarrett Koller 2011/07/27 15:31
  • Changed the condor UID and GID to 64. Since 64 is in the system ID range, it will not be assigned to normal users. Furthermore, Condor might (or might not) be “well-known” to use 64 for its UID and GID. 64 is “well-known” to be used for the condor user and group according to the Red Hat Enterprise Linux 6 Deployment Guide. In order to change the ownership of all of the old condor user and group files, I ran the following script on all of the machines in the Condor pool:
    ChangeCondorID.sh
    for file in `find / -uid 1344`
      do sudo chown -v condor "$file"
    done
    for file in `find / -gid 1610`
      do chown -v :condor "$file"
    done

    Garrett Koller 2011/07/28 15:03

  • Modified install script and uninstall script to automatically add (and remove, respectively) the hostname to the PoolMembers variable in the global configuration file. I did this with a few simple sed commands. — Garrett Koller 2011/07/29 14:41
  • :!: Run checkPrimes program on Condor and test checkpointing functionality.
    • Force a machine (or slot) to checkpoint its jobs
      • Try condor_qedit to edit the classads of jobs on the Condor queue
      • Try condor checkpoint
      • Try condor_config_val to manually control preemptions and checkpoints
    • Make sure SUSPEND and PREEMPT (with checkpointing) work.
  • It turns out that even though a slot listed in condor_status may say it has 1.5 Gigs, a job's RAM usage is not limited to this number. Instead, this number seems to come into play only during the negotiation phase to figure out where the job will run and only if the user who submitted the job bothered to mention the amount of RAM their job would need in the Requirements variable and possibly the ImageSize variable (which is more like the checkpoint file's size).
  • Fixed RANK expression error. RANK refers to the machine's preferences for jobs that run on it, so the Owner variable is applicable since it is defined in the job's ClassAd. DEFAULT_RANK, on the other hand, refers to how a job ranks the machines it wants to run on, so the SlotID, KeyboardIdle, and LoadAvg variables are applicable since they are defined in the machine's ClassAd.
  • Got the standard universe to work and it seems to work well with programs that were compiled with condor_compile. Thus, checkpointing also works.
  • Created the Checkpointing Programs page for instruction on how to use the condor_compile command. — Garrett Koller 2011/08/05 17:34
  • DEFAULT_RANK is now configured to use the “last” CPUs before using the “first” CPUs on a machine, since “first” CPUs are used by the user first.
  • Got the vanilla universe to work again. Before, programs in the vanilla universe would run but then hang after transferring its output.
    • The ShadowLog mentioned a bunch of SetEffectiveOwner(kollerg) failed with errno=13: Permission denied.. I enabled full debugging for the Shadow daemon, but couldn't find the source of the error.
    • The SchedLog mentioned a bunch of “OwnerCheck(condor_pool) failed in SetAttribute for job 41.0”. I enabled full debugging for the Scheduler daemon, but no luck there either.
    • Finally, I decided to reset all of the configuration variables in condor_config_global back to their default values and change them back one-by-one to what it was to see where the error came from. I discovered that the error had to do with the incomplete authentication that was going on by Condor preferring (incomplete) password authentication as opposed to the what should have been (complete) username authentication with FS and FS_REMOTE. I then simply fixed the authentication to try full username authentication first in order to authenticate processes based on their user ID instead of their oh-I-happen-to-have-a-password-but-you-can't-verify-who-I-am-specifically-ness. For more info, see Authentication in Condor.
  • Finished the Authentication in Condor page. Explained the downsides of password-based authentication (→ “condor_pool@cs.wlu.edu”). — Garrett Koller 2011/08/09 16:04
  • FINALLY got MATLAB to run in the vanilla universe. I came across two major problems that I had to figure out: — Garrett Koller 2011/08/09 17:17
    1. The value of the Arguments variable in the submit file must be surrounded by double quotes. (Quotes within that must be escaped with a backslash.) For executing a top level function from the command line using the argument -r <MATLAB_Command>, “<MATLAB_Command>” must be surrounded by single quotes (they don't need to be escaped). So, to run the HellowOrld() function and exit, the Arguments variable in the submit file must be as follows:
      Arguments = "-nodisplay -r 'HellowOrld(); exit;'"
    2. In order to transfer a MATLAB function file (*.m) with the submission, I had to turn on file transferring. This inadvertently also transfers the MATLAB executable, even if you give the whole pathname of the MATLAB executable (/usr/local/bin/matlab). MATLAB doesn't like not running from deep within the bowels of the /usr/local/ directory, so to tell Condor to transfer input and output files normally but not transfer the executable (therefore assuming that it's present on the execute machine), the following variable must be set in the submit file:
      transfer_executable = false
  • Configured john to be a CondorView Server (Condor historical statistics web application)
    • Made condor.cs.wlu.edu/index.html redirect to the Condor Wiki homepage while still allowing access to CondorView.
    • Added link to CondorView within Wiki.
  • Made a mail mapping file (a pickled Python dictionary) to map CS Unix usernames to W&L email addresses (@mail.wlu.edu). Used this in the Python script to add the email address to the job submission automatically unless the user specifies another email address. — Garrett Koller 2011/08/16 16:59
  • Instead of lying to Condor about how many CPUs each computer has to be half of what it actually is (to account for hyperthreaded CPUs), I set COUNT_HYPERTHREAD_CPUS = False in the global configuration to fix this problem. Now, we have better response for the interactive user while also allowing a job running on a slot to run faster since it is less likely to be competing with another Condor job on a complementary hyperthread on the same CPU. — Garrett Koller 2011/08/17 11:33
  • Implemented Job.status(), Job.poll(), and Job.wait() in the condor Python module to let the user know if and when their jobs are done. — Garrett Koller 2011/08/18 19:10
  • Enabled dynamic slot allocation so that jobs are given only the resources they need. — Garrett Koller 2011/08/19 16:07
  • Redefined NonCondorLoadAvg in global config file to include Cpus (LoadAvg - CondorLoadAvg / Cpus = NonCondorLoadAvg), assuming that everything that uses NonCondorLoadAvg has access to the Cpus Startd variable. This variable is used in the START expression so that a computer with a lot of cores that are each a tiny bit busy will still be able to accept jobs. — Garrett Koller 2011/08/27 22:48
  • Configured and enabled Backfill jobs — Garrett Koller 2011/09/12 17:03
    • In install script, ask to install and enable backfill jobs. If so, install boinc-client and add ENABLE_BACKFILL = TRUE to local (host-specific) configuration file.
    • Added yum install boinc-client (newer version, x86_64 architecture) to the install script.
  • Added setRAM(), setCPUNum(), and setDisk() to condor.py allow jobs to request for its resources. If the user doesn't specify, it is assumed that each job needs 1 CPU, 1 Gig of RAM, and 32 MB of disk space. — Garrett Koller 2011/09/13 17:45
  • Added more comments for the sake of documentation. This documentation can be accessed by the user with the Python help() function, such as by running help(Condor.Job). In Python, first line of the docstrings serves as the text in the pop-up tooltip that appears in Idle when a user starts typing in the arguments of the function. I made that text be a quick snippet of each function's syntax. — Garrett Koller 2011/09/15 17:36
  • Created Python script to edit the email mapping (pickled) file. Added it to the condor-py-scripting Google Code project. — Garrett Koller 2011/09/15 17:36
  • Finalize R. E. Lee Research — Garrett Koller 2011/10/01 13:26
    • Print poster
  • Downloaded condor-7.6.4-x86_64_rhap_6.1-updated-stripped source and installed it into a new release directory. — Garrett Koller 2011/12/01 12:09
condor/installation/todo_list.txt · Last modified: 2012/01/27 17:45 by garrettheath4
CC Attribution-Noncommercial-Share Alike 4.0 International
Driven by DokuWiki Recent changes RSS feed Valid CSS Valid XHTML 1.0