(本文不定期更新)
目录
二、指定虚拟python环境的python类型任务的执行时环境变量问题
一、租户问题
本人使用环境是大家分别使用自己账户登录跳板机,然后共同使用同一个服务器linux账户进行工作,所以不涉及多个linux用户的切换问题(即dolphinscheduler的多租户)。但是共用账户本身没有root权限,需要找运维申请,但是权限时间有限。故在部署的时候就不使用root权限账户,而直接使用共用账户部署。然后下载源码后,将涉及“sudo -u”相关的命令注释掉或者改写,从而实现正常使用。
二、指定虚拟python环境的python类型任务的执行时环境变量问题
因为某些特殊任务需要使用虚拟python环境,独立进行worker分组,并指定该环境为dolphinscheduler-env.sh的PYTHON_HOME。但是在执行的时候,出现了报错:
找不到JAVA_HOME,这个是使用pyflink的时候,由pyflink在创建执行环境的时候报的错:
s_env = StreamExecutionEnvironment.get_execution_environment()
将该代码注释掉,加入代码:
import os
print('os的环境变量有:',os.system('env'))
查看到该次执行的JAVA_HOME为空:
但是dolphinscheduler-env.sh命名设置了JAVA_HOME,重启worker-server也不生效。
最终解决:设置该用户的环境变量(修改全局环境变量有风险,只修改该用户的即可)
vim ~/.bashrc
export JAVA_HOME=/usr
export PATH=$JAVA_HOME/bin:$PATH
因为使用which java是能够在/usr/bin目录找到可执行java的,所以只设置为/usr即可。
三、资源中心创建目录或者文件失败
api-server报的错,查看日志:
### The error occurred while setting parameters
### SQL: INSERT INTO t_ds_resources ( file_name, size, create_time, description, full_name, alias, update_time, pid, type, user_id, is_directory ) VALUES ( ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ? )
### Cause: org.postgresql.util.PSQLException: ERROR: column "is_directory" is of type integer but expression is of type boolean
Hint: You will need to rewrite or cast the expression.
Position: 192
; bad SQL grammar []; nested exception is org.postgresql.util.PSQLException: ERROR: column "is_directory" is of type integer but expression is of type boolean
Hint: You will need to rewrite or cast the expression.
Position: 192
at org.springframework.jdbc.support.SQLErrorCodeSQLExceptionTranslator.doTranslate(SQLErrorCodeSQLExceptionTranslator.java:239)
at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:72)
at org.mybatis.spring.MyBatisExceptionTranslator.translateExceptionIfPossible(MyBatisExceptionTranslator.java:74)
at org.mybatis.spring.SqlSessionTemplate$SqlSessionInterceptor.invoke(SqlSessionTemplate.java:440)
at com.sun.proxy.$Proxy93.insert(Unknown Source)
at org.mybatis.spring.SqlSessionTemplate.insert(SqlSessionTemplate.java:271)
at com.baomidou.mybatisplus.core.override.MybatisMapperMethod.execute(MybatisMapperMethod.java:58)
at com.baomidou.mybatisplus.core.override.MybatisMapperProxy.invoke(MybatisMapperProxy.java:61)
at com.sun.proxy.$Proxy104.insert(Unknown Source)
at org.apache.dolphinscheduler.api.service.ResourcesService.createDirectory(ResourcesService.java:130)
经过查看我的postgresql的类型转换设置:命令:\dC (复制了部分内容)
postgres=# \dC
List of casts
Source type | Target type | Function | Implicit?
-----------------------------+-----------------------------+--------------------+---------------
bigint | regoperator | oid | yes
bigint | regproc | oid | yes
bigint | regprocedure | oid | yes
bigint | regrole | oid | yes
bigint | regtype | oid | yes
bigint | smallint | int2 | in assignment
bit | bigint | int8 | no
bit | bit | bit | yes
bit | bit varying | (binary coercible) | yes
bit | integer | int4 | no
bit varying | bit | (binary coercible) | yes
bit varying | bit varying | varbit | yes
boolean | character | text | in assignment
boolean | character varying | text | in assignment
boolean | integer | int4 | no
boolean | text | text | in assignment
integer | abstime | (binary coercible) | no
integer | bigint | int8 | yes
integer | bit | bit | no
integer | boolean | bool | no
发现integer和boolean都不能自动互相转型。
根据规则,castcontext='e'则表示no,'a'表示in assignment(只在赋值的时候进行类型转化),其它则表示yes(赋值和表达式中都进行类型转化)
因为dolphinscheduler的日志中报错我的表达式类型不对,那么就考虑改成yes,赋值'e'和'a'以外的值即可,见名知意,赋值'y'。
update pg_cast set castcontext='y' where (castsource ='integer'::regtype and casttarget='boolean'::regtype) or (castsource ='boolean'::regtype and casttarget='integer'::regtype);
UPDATE 2postgres=# \dC
List of casts
Source type | Target type | Function | Implicit?
-----------------------------+-----------------------------+--------------------+---------------
boolean | integer | int4 | yes
integer | boolean | bool | yes
但是仍然报错,考虑修改字段类型
ALTER TABLE t_ds_resources ALTER COLUMN is_directory TYPE boolean;
psql提示:
dolphinscheduler=# ALTER TABLE t_ds_resources ALTER COLUMN is_directory TYPE boolean;
ERROR: column "is_directory" cannot be cast automatically to type boolean
HINT: You might need to specify "USING is_directory::boolean".
修改命令:
dolphinscheduler=# ALTER TABLE t_ds_resources ALTER COLUMN is_directory TYPE boolean USING is_directory::boolean;
ALTER TABLE
可以正常使用。
四、由zookeeper同步引起的shutdown问题
[INFO] 2021-12-11 06:41:13.049 org.apache.dolphinscheduler.server.master.registry.MasterRegistry:[92] - master : {my_ip:port} reconnected to zookeeper
[INFO] 2021-12-11 06:41:13.051 org.apache.dolphinscheduler.server.master.registry.ServerNodeManager:[258] - master node : /dolphinscheduler/nodes/master/{my_ip:port} down.
[INFO] 2021-12-11 06:41:13.056 org.apache.dolphinscheduler.service.zk.ZookeeperOperator:[79] - reconnected to zookeeper
[INFO] 2021-12-11 06:41:13.077 org.apache.dolphinscheduler.server.master.registry.ServerNodeManager:[254] - master node : /dolphinscheduler/nodes/master/{my_ip:port} added.
[INFO] 2021-12-11 06:41:14.168 org.apache.dolphinscheduler.service.log.LogClientService:[100] - view log path /home/log_proxy/dolphinscheduler/installation/logs/248/14898/27097.log
[INFO] 2021-12-11 06:41:16.284 org.apache.dolphinscheduler.server.master.registry.ServerNodeManager:[221] - worker group node : /dolphinscheduler/nodes/worker/default/{my_ip:port} down.
[INFO] 2021-12-11 06:41:16.308 org.apache.dolphinscheduler.server.master.registry.ServerNodeManager:[221] - worker group node : /dolphinscheduler/nodes/worker/spark2/{my_ip:port} down.2021-12-07 19:03:49,243 [myid:7] - WARN [SyncThread:7:FileTxnLog@408] - fsync-ing the write ahead log in SyncThread:7 took 2517ms which will adversely effect operation latency.File size is 67108880 bytes. See the ZooKeeper troubleshooting guide
2021-12-03 19:51:52,437 [myid:7] - WARN [NIOWorkerThread-19:NIOServerCnxn@364] - Unexpected exception
EndOfStreamException: Unable to read additional data from client, it probably closed the socket: address = /{my_ip:port}, session = 0x7139f84ec880024
at org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:163)
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:326)
at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2021-12-03 19:51:52,437 [myid:7] - WARN [NIOWorkerThread-18:NIOServerCnxn@364] - Unexpected exception
EndOfStreamException: Unable to read additional data from client, it probably closed the socket: address = /{my_ip:port}, session = 0x7139f84ec880026
关键信息是这一句话:
fsync-ing the write ahead log in SyncThread:7 took 2517ms which will adversely effect operation latency.File size is 67108880 bytes. See the ZooKeeper troubleshooting guide
意思是zookeeper节点之间同步数据延迟过高,因为一次同步的数据文件太大。那么思路来了,把每次同步的数据文件的大小改小些,或者同步频率加快些,网络的压力就没那么大了,就不会出现同步失败而丢失dolphin节点信息了。
修改配置:zookeeper的conf中的zoo.cfg
# The number of milliseconds of each tick
tickTime=8000
# The number of ticks that the initial
# synchronization phase can take
initLimit=10
# The number of ticks that can pass between
# sending a request and getting an acknowledgement
syncLimit=5
# the directory where the snapshot is stored.
# do not use /tmp for storage, /tmp here is just
# example sakes.
dataDir=XXXX/data
# 自己创建的log目录
dataLogDir=XXXX/log
# the port at which the clients will connect
clientPort=xxxx
# the maximum number of client connections.
# increase this if you need to handle more clients
maxClientCnxns=256
maxCnxns=256# 设置日志大小,单位:kb,太大会导致主从同步变慢,造成影响。
preAllocSize=5120
#
# Be sure to read the maintenance section of the
# administrator guide before turning on autopurge.
#
# http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance
#
# The number of snapshots to retain in dataDir
autopurge.snapRetainCount=7
# Purge task interval in hours
# Set to "0" to disable auto purge feature
autopurge.purgeInterval=6
该问题明显得到解决,具体参数设置依据自己环境合理设置,仅供参考。