How to overcome the issues of configuring condor

最新推荐文章于 2022-07-25 16:59:29 发布

zhxue123

最新推荐文章于 2022-07-25 16:59:29 发布

阅读量1.4k

点赞数

分类专栏： Grid 文章标签： authentication file interface network methods security

本文链接：https://blog.csdn.net/zhxue123/article/details/4333667

版权

Grid 专栏收录该内容

21 篇文章 0 订阅

订阅专栏

How to overcome the issues of configuring condor

Condor records the error message in the file local.***/log/Masterlog. We can find errors via this file.

1.not binding to central manager.

7/9 02:26:42 Failed to start non-blocking update to <159.226.3.188:9618>.
7/9 02:31:42 ERROR: SECMAN:2004:Failed to start a session to <159.226.3.188:9618> with TCP|AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using PASSWORD

You need copy central manager's file: condor/condor/local.c2401/pool_password to the same dir of worker node. The config file about this issue seems like the following:

# Security setup to use pool password
SEC_PASSWORD_FILE = /opt/app/condor/condor/local.$(HOSTNAME)/pool_password
SEC_DAEMON_AUTHENTICATION = REQUIRED
SEC_DAEMON_INTEGRITY = REQUIRED
SEC_DAEMON_AUTHENTICATION_METHODS = PASSWORD
SEC_NEGOTIATOR_AUTHENTICATION = REQUIRED
SEC_NEGOTIATOR_INTEGRITY = REQUIRED
SEC_NEGOTIATOR_AUTHENTICATION_METHODS = PASSWORD
SEC_CLIENT_AUTHENTICATION_METHODS = FS, PASSWORD

2. IP resolution

IPVERIFY: unable to resolve IP address of c2403

You can edit the /etc/hosts file,and add the host information to it.

hostname IPAddress

If your server has two network cards, please notice the IP in /etc/hosts, or you can set interface card in condor_config file

3.Using proper network interface

7/8 11:57:42 Failed to bind to command ReliSock
7/8 11:57:42 (Make sure your IP address is correct in /etc/hosts.)
7/8 11:57:42 ERROR "BindAnyCommandPort() failed" at line 8408 in file daemon_core.C

Condor binds all network sockets to the first interface found. To bind to a specific network interface other than the first one, this variable NETWORK_INTERFACE in the local config file should be set to the IP address to use.

4. When you use NFS, you need copy the local file of NFS server to other nodes.

cp -rp local.c2403 local.c2401

Bear in mind the parameter "p".

5. sourcing condor/setup.sh, and then starting condor : condor/sbin/condor_master

6. You can test if the WN joins the pool by condor_status. It seems work that you should firstly run this command in central manager, and then do it in the worker ndoes

7. We migrated CE from lab1 to lab2, and change its IP whereas host domain name remains. We remap the IP and host domain name, and vdt-control --on. The rsv service failed. We look up the vdt_install file, it prompts

-- Failed to fetch ads from: <159.226.3.188:55303> : osg.cnic.cn
CEDAR:6001:Failed to connect to <159.226.3.188:55303>

I used condor_cron_q, it prompts the same as above.

I modified /etc/hosts with right IP and hostanme, but the problem kept unsolved.

After restart the server, everything is ok. I am very puzzled that he modification of /etc/hosts takes effect after reboot.

8. condor_config文件

/opt/osg-1.0.1/condor-cron/condor_configure

/opt/osg-1.0.1/condor-cron/etc/condor_config

/opt/osg-1.0.1/condor-cron/sbin/condor_configure

/opt/osg-1.0.1/condor/condor_configure

/opt/osg-1.0.1/condor/etc/condor_config

/opt/osg-1.0.1/condor/sbin/condor_configure

/opt/osg-1.0.1/condor/local.osg/condor_config.local

It is too much, I choose the following in my ...

1) vi setup.sh, and specify

. /opt/osg-1.0.1/vdt/etc/condor-env.sh

2) vi vdt/etc/condor-env.sh

PATH=/opt/osg-1.0.1/condor-cron/bin:$PATH
PATH=/opt/osg-1.0.1/condor-cron/sbin:$PATH
export PATH

MANPATH=/opt/osg-1.0.1/condor/man:$MANPATH
export MANPATH

CONDOR_LOCATION=/opt/osg-1.0.1/condor-cron
export CONDOR_LOCATION

CONDOR_CONFIG=/opt/osg-1.0.1/condor-cron/etc/condor_config
export CONDOR_CONFIG

3) vi condor-cron/etc/condor_config

LOCAL_CONFIG_FILE = /opt/osg-1.0.1/condor-cron/local.localhost/condor_config.local

9. Generate the pool password.

The following is excerpted from the link：http://www.cs.wisc.edu/condor/osg_security_recommendations.html

 The version of the condor_store_cred tool distributed with Condor 6.8.5 has an option that allows it to generate a file that can be used for PASSWORD authentication on UNIX Condor installations. To generate the password, run:

 $ condor_store_cred -f pool_password

Enter password: <enter password here>

The file pool_password now contains the file needed by Condor to do PASSWORD

Authentication – skip to step 2 below.

10. Install condor

You can either use yum to install condor, or condor tar file.

[root@gsiftp condor-7.4.2]# ./condor_configure --install ./ --owner daemon

One should install condor in the dir same to the untar dir, otherwise it will not find some install files. --owner must be specified.

don't forget specify allow_write in the config file.

11. CONDOR_*** variable is not the system variable

The following is wrong.

SEC_PASSWORD_FILE = $(LOCAL_DIR)/pool_password //local_dir is a condor variable not system variable

12. JavaDetect问题：

11/09 11:13:35 JavaDetect: failure status 256 when executing /usr/bin/java -Xmx1024m1340m -classpath /opt/condor-7.4.2/lib:/opt/condor-7.4.2/lib/scimark2lib.jar:. CondorJavaInfo old 2

解决：
1）
[root@mpi002 condor-7.4.2]# condor_starter -classad
CondorVersion = "$CondorVersion: 7.4.2 Nov 9 2011 $"
IsDaemonCore = True
HasFileTransfer = True
HasPerFileEncryption = True
HasReconnect = True
HasMPI = True
HasTDP = True
HasJobDeferral = True
HasJICLocalConfig = True
HasJICLocalStdin = True
Invalid maximum heap size: -Xmx512m1340m
Could not create the Java virtual machine.
HasVM = True

2）修改local config文件
JAVA_MAXHEAP_ARGUMENT = -Xmx1024 改为
JAVA_MAXHEAP_ARGUMENT = -Xmx

13 condor_status探测不到计算节点

在计算节点上执行下列命令：

tail -f -n 100 /opt/condor-7.4.2/local.mpi002/log/StartLog

11/09 14:31:28 attempt to connect to <192.168.137.2:42303> failed: timed out after 20 seconds.
11/09 14:31:28 ERROR: SECMAN:2003:TCP auth connection to <192.168.137.2:42303> failed.
11/09 14:31:28 Failed to send alive to <192.168.137.2:42303>, will try again...

在其他节点上用telnet命令链接端口：

telnet 192.168.137.2 42303

发现可以连上。于是，断定是本节点没有回环地址，于是

ifup lo

解决

14 CLASS AD

想实现节点名包含mpi如mpi001-mpi032

于是用下列东西：req = "machine <=\"mpi\"";

zhxue123

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
How to overcome the issues of configuring condor

How to overcome the issues of configuring condor Condor records the error message in the file local.***/log/Masterlog. We can find errors via this file. 1.not binding to central manager. 7
复制链接

扫一扫

专栏目录