Table of Contents
Network Installation of Condor
This page describes the installation procedure for installing Condor on a network-shared location. Installing Condor on the network allows for centralized configuration and convenient upgrades. This network-shared location should be on a high-availability file server, preferably a high-bandwidth RAID-enabled NAS, that can be accessed from all members of the Condor pool.
Our pool uses a dedicated NAS server with a 2.7TB RAID 5+1 drive configuration to provide faster than normal access times to the data than normal hard drive I/O, especially data reads.
Add Local condor User
In order for daemons to run correctly and for permissions to be properly set, a local
condor user must be present on all members of the Condor pool. The following must be set for the
condor UID = 64 condor GID = 64
First, check to see if the
condor user exists on the machine. Do this by running:
cat /etc/passwd | grep ^condor:
If you get a match, first reset its settings in case the user wasn't created correctly.
sudo groupmod -g 64 condor sudo usermod -c "Owner of Condor Daemons" -d "/var/lib/condor" -m -u 64 -g condor -s "/sbin/nologin" -L condor
If you get a message that says that the directory
/var/lib/condor already exists, run this command next:
sudo chown -R condor:condor /var/lib/condor
If you do not get a match, you need to manually add the user. To do this, run:
sudo groupadd -g 64 condor sudo useradd -c "Owner of Condor Daemons" -d "/var/lib/condor" -m -u 64 -g condor -s "/sbin/nologin" condor sudo usermod -L condor
Just to be sure, do
ls -al /var/lib/condor
and verify that the entry “
.” is owned by
condor and is a part of the
condor group. If not, you probably have a conflicting UID or GID and will have to set it manually. Set it to one that is not being used by the local user system or by the network and then set the
CONDOR_IDS variable in that individual host's Condor local configuration file1)
The NAS is used to store all of the program files that Condor needs to run. Installing these onto the NAS only needs to happen once, but in order to recompile the binaries onto the
tesla.cs.wlu.edu NAS anyway, run this command in the terminal:
cd /mnt/config/src/fedora64 sudo ./condor_configure --type=manager,submit,execute --central-manager=john.cs.wlu.edu --local-dir=/mnt/config/hosts/_default --install-dir=/mnt/config/release/x86_64_rhap_5 --owner=condor --install --verbose
Set Machine Variables
condor_master program opens, the first thing it does is look for the global configuration file.
The problem with putting as much of Condor on the NAS is that this introduces a lot of NFS traffic onto the network, especially when Condor jobs are running. Having the user executables stored centrally on the NAS will cause all of the computers to be almost constantly reading from the NAS when the executables are opened and run.
The W&L Computer Science Department Systems Administrator, Steve Goryl, had a similar problem when all of the Linux applications on the lab computers were actually centrally located and run from the central CS department server. This proved to produce higher-than-expected traffic on the network and the programs became laggy. Installing the applications locally on the hard drives of the lab computers proved to be more of a pain administratively but provided much better overall performance.
We can still have good performance2) while having Condor centrally located by having Condor's binaries and all of the configuration files located on the NAS while storing currently-running Condor job user executables locally on the executing machine's hard drive. Condor's binaries will stay on the NAS for the sake of easy upgrades and job binaries will be stored on and run from the execute machines' local hard drives.
In order to do this, we need to create certain directories on every machine that are owned by the (local)
condor user. These directories will serve as the playground for condor jobs when they are executing on a machine. To do this, we need to create such directories and then tell Condor where they are and what to do with them.
sudo mkdir /var/lib/condor/execute sudo chown -R condor:condor /var/lib/condor/execute sudo chmod -R 755 /var/lib/condor/execute
As specified in our Condor system's global configuration file, access to Condor is restricted to certain machines and usernames. Whenever Condor receives a request, it first checks to see if the requester is allowed to make such a request. Unfortunately, the requesting machine can lie about who it is and therefore “spoof” Condor into thinking the request is coming from a valid source. In order to help prevent this from happening, Condor uses basic authentication to protect it from computers disguised as valid members of its pool. This authentication takes the form of an encrypted password. When Condor starts, it will read the configuration files to figure out where the password is stored. As listed in the global configuration file as the
SEC_PASSWORD_FILE configuration variable, the password is stored as
/var/lib/condor/pool_password with root-only access. In order for machines to be added to the Condor pool, this file must be manually copied from an existing member of the pool to the new member. Once copied, this file must be owned by
root and have read and write access to the owner but all other permissions disabled (mode
Condor primarily uses port 9618 for communication between the
condor_master daemons on each of the members of the Condor pool. Because of this, the firewall of each of the members needs to have port 9618 open to accept incoming communication. Condor uses a lot of other dynamically-chosen ports for direct communication between other daemons that want to bypass the
condor_master daemon (in order to not bog down the busy
condor_master daemon, of course). If the daemons are configured to publish their port number publicly (in the filesystem), the daemons should be allowed to directly communicate with each other.
In order to do this, a large range of ports needs to be opened so that all of the Condor daemons can freely communicate with each other while still having dynamically-allocated ports. Specifically, the configuration variables
LOWPORT in the global configuration file defines what range of ports Condor is allowed to use. By default and/or by convention, this range is 9600-9700. To open this port range, run
root and add the 9600-9700 user-defined tcp and udp port ranges to the “Other Ports” section. Click “Apply” to finish the deed. Now, Condor daemons can freely communicate while not being impeded by the firewall.