How to overcome the issues of configuring condor
Condor records the error message in the file local.***/log/Masterlog. We can find errors via this file.
1.not binding to central manager.
7/9 02:26:42 Failed to start non-blocking update to <159.226.3.188:9618>.
7/9 02:31:42 ERROR: SECMAN:2004:Failed to start a session to <159.226.3.188:9618> with TCP|AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using PASSWORD
You need copy central manager's file: condor/condor/local.c2401/pool_password to the same dir of worker node. The config file about this issue seems like the following:
# Security setup to use pool password
SEC_PASSWORD_FILE = /opt/app/condor/condor/local.$(HOSTNAME)/pool_password
SEC_DAEMON_AUTHENTICATION = REQUIRED
SEC_DAEMON_INTEGRITY = REQUIRED
SEC_DAEMON_AUTHENTICATION_METHODS = PASSWORD
SEC_NEGOTIATOR_AUTHENTICATION = REQUIRED
SEC_NEGOTIATOR_INTEGRITY = REQUIRED
SEC_NEGOTIATOR_AUTHENTICATION_METHODS = PASSWORD
SEC_CLIENT_AUTHENTICATION_METHODS = FS, PASSWORD
2. IP resolution
IPVERIFY: unable to resolve IP address of c2403
You can edit the /etc/hosts file,and add the host information to it.
hostname IPAddress
hostname IPAddress
If your server has two network cards, please notice the IP in /etc/hosts, or you can set interface card in condor_config file
3.Using proper network interface
7/8 11:57:42 Failed to bind to command ReliSock
7/8 11:57:42 (Make sure your IP address is correct in /etc/hosts.)
7/8 11:57:42 ERROR "BindAnyCommandPort() failed" at line 8408 in file daemon_core.C
Condor binds all network sockets to the first interface found. To bind to a specific network interface other than the first one, this variable NETWORK_INTERFACE in the local config file should be set to the IP address to use.
4. When you use NFS, you need copy the local file of NFS server to other nodes.
cp -rp local.c2403 local.c2401
Bear in mind the parameter "p".
5. sourcing condor/setup.sh, and then starting condor : condor/sbin/condor_master
6. You can test if the WN joins the pool by condor_status. It seems work that you should firstly run this command in central manager, and then do it in the worker ndoes
7. We migrated CE from lab1 to lab2, and change its IP whereas host domain name remains. We remap the IP and host domain name, and vdt-control --on. The rsv service failed. We look up the vdt_install file, it prompts
-- Failed to fetch ads from: <159.226.3.188:55303> : osg.cnic.cn
CEDAR:6001:Failed to connect to <159.226.3.188:55303>
I used condor_cron_q, it prompts the same as above.
I modified /etc/hosts with right IP and hostanme, but the problem kept unsolved.
After restart the server, everything is ok. I am very puzzled that he modification of /etc/hosts takes effect after reboot.
8. condor_config文件
/opt/osg-1.0.1/condor-cron/condor_configure
/opt/osg-1.0.1/condor-cron/etc/condor_config
/opt/osg-1.0.1/condor-cron/sbin/condor_configure
/opt/osg-1.0.1/condor/condor_configure
/opt/osg-1.0.1/condor/etc/condor_config
/opt/osg-1.0.1/condor/sbin/condor_configure
/opt/osg-1.0.1/condor/local.osg/condor_config.local
It is too much, I choose the following in my ...
1) vi setup.sh, and specify
. /opt/osg-1.0.1/vdt/etc/condor-env.sh
2) vi vdt/etc/condor-env.sh
PATH=/opt/osg-1.0.1/condor-cron/bin:$PATH
PATH=/opt/osg-1.0.1/condor-cron/sbin:$PATH
export PATH
MANPATH=/opt/osg-1.0.1/condor/man:$MANPATH
export MANPATH
CONDOR_LOCATION=/opt/osg-1.0.1/condor-cron
export CONDOR_LOCATION
CONDOR_CONFIG=/opt/osg-1.0.1/condor-cron/etc/condor_config
export CONDOR_CONFIG
3) vi condor-cron/etc/condor_config
LOCAL_CONFIG_FILE = /opt/osg-1.0.1/condor-cron/local.localhost/condor_config.local
9. Generate the pool password.
The following is excerpted from the link:http://www.cs.wisc.edu/condor/osg_security_recommendations.html
The version of the condor_store_cred tool distributed with Condor 6.8.5 has an option that allows it to generate a file that can be used for PASSWORD authentication on UNIX Condor installations. To generate the password, run:
$ condor_store_cred -f pool_password
Enter password: <enter password here>
The file pool_password now contains the file needed by Condor to do PASSWORD
Authentication – skip to step 2 below.
10. Install condor
You can either use yum to install condor, or condor tar file.
[root@gsiftp condor-7.4.2]# ./condor_configure --install ./ --owner daemon
One should install condor in the dir same to the untar dir, otherwise it will not find some install files. --owner must be specified.
don't forget specify allow_write in the config file.
11. CONDOR_*** variable is not the system variable
The following is wrong.
SEC_PASSWORD_FILE = $(LOCAL_DIR)/pool_password //local_dir is a condor variable not system variable
12. JavaDetect问题:
11/09 11:13:35 JavaDetect: failure status 256 when executing /usr/bin/java -Xmx1024m1340m -classpath /opt/condor-7.4.2/lib:/opt/condor-7.4.2/lib/scimark2lib.jar:. CondorJavaInfo old 2
解决:
1)
[root@mpi002 condor-7.4.2]# condor_starter -classad
CondorVersion = "$CondorVersion: 7.4.2 Nov 9 2011 $"
IsDaemonCore = True
HasFileTransfer = True
HasPerFileEncryption = True
HasReconnect = True
HasMPI = True
HasTDP = True
HasJobDeferral = True
HasJICLocalConfig = True
HasJICLocalStdin = True
Invalid maximum heap size: -Xmx512m1340m
Could not create the Java virtual machine.
HasVM = True
2) 修改local config文件
JAVA_MAXHEAP_ARGUMENT = -Xmx1024 改为
JAVA_MAXHEAP_ARGUMENT = -Xmx
13 condor_status探测不到计算节点
在计算节点上执行下列命令:
tail -f -n 100 /opt/condor-7.4.2/local.mpi002/log/StartLog
11/09 14:31:28 attempt to connect to <192.168.137.2:42303> failed: timed out after 20 seconds.
11/09 14:31:28 ERROR: SECMAN:2003:TCP auth connection to <192.168.137.2:42303> failed.
11/09 14:31:28 Failed to send alive to <192.168.137.2:42303>, will try again...
在其他节点上用telnet命令链接端口:
telnet 192.168.137.2 42303
发现可以连上。于是,断定是本节点没有回环地址,于是
ifup lo
解决
14 CLASS AD
想实现节点名包含mpi如mpi001-mpi032
于是用下列东西:req = "machine <=\"mpi\"";