How to overcome the issues of configuring condor

How to overcome the issues of configuring condor

 

Condor records the error message in the file local.***/log/Masterlog. We can find errors via this file.

 

1.not binding to central manager.

 

7/9 02:26:42 Failed to start non-blocking update to <159.226.3.188:9618>.
7/9 02:31:42 ERROR: SECMAN:2004:Failed to start a session to <159.226.3.188:9618> with TCP|AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using PASSWORD

 

You need copy central manager's file: condor/condor/local.c2401/pool_password  to the same dir of worker node. The config file about this issue seems like the following:

 

# Security setup to use pool password
SEC_PASSWORD_FILE = /opt/app/condor/condor/local.$(HOSTNAME)/pool_password
SEC_DAEMON_AUTHENTICATION = REQUIRED
SEC_DAEMON_INTEGRITY = REQUIRED
SEC_DAEMON_AUTHENTICATION_METHODS = PASSWORD
SEC_NEGOTIATOR_AUTHENTICATION = REQUIRED
SEC_NEGOTIATOR_INTEGRITY = REQUIRED
SEC_NEGOTIATOR_AUTHENTICATION_METHODS = PASSWORD
SEC_CLIENT_AUTHENTICATION_METHODS = FS, PASSWORD

 

2. IP resolution

 

IPVERIFY: unable to resolve IP address of c2403

 

You can edit the /etc/hosts file,and add the host information to it.

hostname  IPAddress

hostname  IPAddress

 

If your server has two network cards, please notice the IP in /etc/hosts, or you can set interface card in condor_config file

 

3.Using proper network interface

 

7/8 11:57:42 Failed to bind to command ReliSock
7/8 11:57:42 (Make sure your IP address is correct in /etc/hosts.)
7/8 11:57:42 ERROR "BindAnyCommandPort() failed" at line 8408 in file daemon_core.C

 

Condor binds all network sockets to the first interface found. To bind to a specific network interface other than the first one, this variable NETWORK_INTERFACE in the local config file should be set to the IP address to use.

 

4. When you use  NFS, you need  copy the local file of NFS server to other nodes.

 

cp -rp local.c2403 local.c2401

Bear in mind the parameter "p".

 

 

5. sourcing condor/setup.sh, and then starting condor : condor/sbin/condor_master

 

 

6. You can test if the WN joins the pool by condor_status. It seems work that you should firstly run this command in central manager, and then do it in the worker ndoes

 

 

7. We migrated CE from lab1 to lab2, and change its IP whereas host domain name remains. We remap the IP and host domain name, and vdt-control --on. The rsv service failed. We look up the vdt_install file, it prompts

 

-- Failed to fetch ads from: <159.226.3.188:55303> : osg.cnic.cn
CEDAR:6001:Failed to connect to <159.226.3.188:55303>

I used condor_cron_q, it prompts the same as above.

 

I modified /etc/hosts with right IP and hostanme, but the problem kept unsolved.

After restart the server, everything is ok. I am very puzzled that he modification of /etc/hosts takes effect after reboot.
 

 

 8. condor_config文件

 

/opt/osg-1.0.1/condor-cron/condor_configure

/opt/osg-1.0.1/condor-cron/etc/condor_config

/opt/osg-1.0.1/condor-cron/sbin/condor_configure

/opt/osg-1.0.1/condor/condor_configure

/opt/osg-1.0.1/condor/etc/condor_config

/opt/osg-1.0.1/condor/sbin/condor_configure

/opt/osg-1.0.1/condor/local.osg/condor_config.local

 

It is too much, I choose the following in my ...

 

1) vi setup.sh, and specify

 

. /opt/osg-1.0.1/vdt/etc/condor-env.sh

 

2) vi vdt/etc/condor-env.sh

 

PATH=/opt/osg-1.0.1/condor-cron/bin:$PATH
PATH=/opt/osg-1.0.1/condor-cron/sbin:$PATH
export PATH

MANPATH=/opt/osg-1.0.1/condor/man:$MANPATH
export MANPATH

CONDOR_LOCATION=/opt/osg-1.0.1/condor-cron
export CONDOR_LOCATION

CONDOR_CONFIG=/opt/osg-1.0.1/condor-cron/etc/condor_config
export CONDOR_CONFIG

 

 

3) vi condor-cron/etc/condor_config

 

LOCAL_CONFIG_FILE = /opt/osg-1.0.1/condor-cron/local.localhost/condor_config.local

 

 

9. Generate the pool password.

 

 

The following is excerpted from the link:http://www.cs.wisc.edu/condor/osg_security_recommendations.html

 The version of the condor_store_cred tool distributed with Condor 6.8.5 has an option that allows it to generate a file that can be used for PASSWORD authentication on UNIX Condor installations. To generate the password, run:
 $ condor_store_cred -f pool_password
Enter password: <enter password here>

 The file pool_password now contains the file needed by Condor to do PASSWORD

Authentication – skip to step 2 below.

 

  10. Install condor

You can either use yum to install condor, or condor tar file.

 

[root@gsiftp condor-7.4.2]# ./condor_configure --install ./  --owner daemon

One should install condor in the dir same to the untar dir, otherwise it will not find some install files. --owner must be specified.

 

don't forget specify allow_write  in the config file.

 

 11. CONDOR_*** variable is not the system variable

The following is wrong.

SEC_PASSWORD_FILE = $(LOCAL_DIR)/pool_password  //local_dir is a condor variable not system variable

 

12. JavaDetect问题:

11/09 11:13:35 JavaDetect: failure status 256 when executing /usr/bin/java -Xmx1024m1340m -classpath /opt/condor-7.4.2/lib:/opt/condor-7.4.2/lib/scimark2lib.jar:. CondorJavaInfo old 2

解决:
1)
[root@mpi002 condor-7.4.2]#  condor_starter -classad
CondorVersion = "$CondorVersion: 7.4.2 Nov  9 2011 $"
IsDaemonCore = True
HasFileTransfer = True
HasPerFileEncryption = True
HasReconnect = True
HasMPI = True
HasTDP = True
HasJobDeferral = True
HasJICLocalConfig = True
HasJICLocalStdin = True
Invalid maximum heap size: -Xmx512m1340m
Could not create the Java virtual machine.
HasVM = True

2) 修改local config文件
JAVA_MAXHEAP_ARGUMENT = -Xmx1024 改为
JAVA_MAXHEAP_ARGUMENT = -Xmx

13 condor_status探测不到计算节点

在计算节点上执行下列命令:

tail -f -n 100  /opt/condor-7.4.2/local.mpi002/log/StartLog

11/09 14:31:28 attempt to connect to <192.168.137.2:42303> failed: timed out after 20 seconds.
11/09 14:31:28 ERROR: SECMAN:2003:TCP auth connection to <192.168.137.2:42303> failed.
11/09 14:31:28 Failed to send alive to <192.168.137.2:42303>, will try again...

在其他节点上用telnet命令链接端口:

telnet 192.168.137.2  42303

发现可以连上。于是,断定是本节点没有回环地址,于是

ifup lo

解决

14 CLASS AD

想实现节点名包含mpi如mpi001-mpi032

于是用下列东西:req = "machine <=\"mpi\"";

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值