工作上花些时间摸索到的记录

linux955

已于 2023-03-06 17:32:32 修改

阅读量1.6k

点赞数

文章标签： mongodb nosql maven java spark

于 2021-11-12 11:26:03 首次发布

本文链接：https://blog.csdn.net/linux955/article/details/121283737

版权

工作上花些时间摸索到的记录

分功能主题列表

分功能主题列表

DBeaver介绍

除了关系型数据库用Navicat这个利器外，其他nosql统统推荐用dbeaver，如redis，elasticsearch，mongodb
这里主要讲下MongoDB如何使用命令

	mongodb查询like
	db.checkHistory.find({rowKey:/^T_HDS_GXJS_JZGGZJL/}).toArray()
	//.limit(3).toArray()
	//db.checkHistory.deleteMany({rowKey:/^T_HDS_GXJG_JZGJCSJZLB/})
	带条件删除
db.manualHandle.deleteMany({tableUniqueCode:'MYSQL_10.0.x.10_3306_nullDBDatabase^HDS_Basics|DBTable^test2019v2'})

spark相关记录

 spark.rdd.rowset中使用 row.getAs[Object](fieldEnName)形式或去字段值才不会报错
 show tabpropperties targettable; 查看表记录
 统计表记录数
 ANALYZE TABLE  表名 COMPUTE STATISTICS;
 拼接 目录下的文件
  –jars $(echo /home/rowen/libs/*.jar | tr ' ' ',')

spark-debug -- conf spark.driver.extraJavaOptions =“-Dorg.slf4j.simpleLogger.defaultLogLevel=trace -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=41999"
- 压缩格式指定
df.write.mode(SaveMode.Overwrite).option("compression", "gzip").csv(s"${path}")

- spark 连接 星环inceptor 的方法，
https://nj.transwarp.cn:8180/?p=3382 
如果用jdbc方式访问，需要把jars下的三个包删除：
spark-hive_2.11-2.3.2.jar               
spark-hive-thriftserver_2.11-2.3.2.jar
hive-jdbc-1.2.1.spark2.jar
如果不用jdbc访问，则需要保留以上3个包，直接用
./spark-shell命令，spark.sql("SELECT * FROM default.test_orc").show()

- docker下进入交互模式
  kubectl get pods | grep kafka
  kubectl exec -it kafka-server-kafka1-58d9ccdd75-5wqzt bash

- 华为mrs环境
替换archive包
cd $SPARK_HOME
grep "yarn.archive" conf/spark-defaults.conf
source  ../../bigdata_env
hdfs dfs -get 'grep到的配置路径'
cp  spark-archive-2x.zip  spark-archive-2x_kafka.zip
unzip -l spark-archive-2x.zip |grep kafka 没存在的话就要加上了
ll jars/streamingClient010/spark-sql-kafka* 找到kafka包
zip -u spark-archive-2x.zip spark-sql-kafka.jar 添加kafka文件到压缩包
hdfs dfs -put spark-archive-2x_kafka.zip '到archive目录'
修改 conf/spark-defaults.conf 中yarn.archive对应的配置文件名加上kafka
然后重新跑任务

-spark3调优
我们可以设置参数spark.sql.adaptive.enabled为true来开启AQE，在Spark 3.0中默认是false，并满足以下条件：
非流式查询
包含至少一个exchange（如join、聚合、窗口算子）或者一个子查询
AQE通过减少了对静态统计数据的依赖，成功解决了Spark CBO的一个难以处理的trade off（生成统计数据的开销和查询耗时）以及数据精度问题。相比之前具有局限性的CBO，现在就显得非常灵活。

- hive创建自定义函数
create temporary function ods.customcheck_date_format as 'com.sefonsoft.dataquality.func.check.check_date_format'
 create function ckdate as 'com.soft.quality.func.check.check_date_format' USING JAR 'hdfs://sxx8142.:8020/user/hive/hiveUDF.jar';

spark-shell创建dataset

spark-shell 构造dataframe测试 表达式

import org.apache.spark.sql.types.{IntegerType, LongType, StringType, StructField, StructType}
import scala.collection.mutable.ArrayBuffer
import org.apache.spark.sql.Row

 val structFields = new ArrayBuffer[StructField]()
structFields += StructField("topic", StringType, true)
 structFields += StructField("partition", IntegerType, true)
structFields += StructField("offset", LongType, true)

val row1 = Row("a", 3, 45L)
val row2 = Row("b", 5, 55L)
val lrow = List(row1, row2)
 val df = spark.createDataFrame(spark.sparkContext.parallelize(lrow), StructType(structFields))

df.select(expr("3*2").as("nub")).show
 df.select(expr("date_add('2022-11-11', 3)").as("express")).show
 df.select(expr("date_add(current_date(), 3)").as("express")).show
df.select(expr("round(sqrt(partition),4)").as("express"), df("partition")).show


上面都挺复杂，提供一种
val ds = Seq("lisi", "wangwu").toDS()
字段名默认是value

idea相关

jvm调试
-Xdebug -Xrunjdwp:transport=dt_socket,suspend=n,server=y,address=10000
spark调试在spark-submit.sh 调用前使用
export SPARK_SUBMIT_OPTS=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=43999


jconsole  远程连接 在jvm启动命令中加入
-Dcom.sun.management.jmxremote  -Dcom.sun.management.jmxremote.port=8999 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false

github下载文件提速
https://tool.mintimate.cn/gh/

redis相关

redis-cli -p 57084 -a Cdsf@119

mysql相关

mysql -uroot --port=3306 -p123456 -h10.x.x.92  -e "show create database DATEE_RS;"

linux相关

替换文件多处值
sed -i  -e "s/ip1/1p2/g" -e "s/ip3/ip4/g" xxx.properties
find . -type f -name '*.html' | xargs sed -i  -e "s/ip5/ip6/g" -e "s/ip7/ip8/g"

排除指定的文件后清空目录
ls | grep -v "zip" | xargs rm -rf

代替telnet
curl http://10.x.x.200:3306
或者
ssh -vp 3306 10.x.x.204

替换eval命令
echo ${cmd}|sh &> ${log_dir}/server-app.log &

强制同步服务器时间
ntpdate -u server-ip

和windows交互rz接收windows文件，sz xx.txt 发送文件到windows
yum install -y lrzsz  

拷贝查询到的数据
ls |grep -v "zip" |xargs -i{} cp -rp {} /home/server-app/plugins/

查看压缩包文件
unzip -l xxx.zip 
添加文件到压缩包 
zip -u xxx.zip abc.jar

elasticsearch相关

华为访问es
source /opt/fi-client/bigdata-env
kinit -kt user.keytab mdev
curl -XGET --tlsv1.2 --negotiate -k -u : 'https://10.x.x.135:24100/website/_search?pretty'

https://elasticstack.blog.csdn.net/   es官方博客

es出现对端断开连接，需要在es客户端设置连接保活，同时设置系统保活时间
1. httpClientBuilder.setDefaultIOReactorConfig(IOReactorConfig.custom().setSoKeepAlive(true)
2. sudo sysctl -w net.ipv4.tcp_keepalive_time=300         sysctl -p

maven打包带commit信息

<plugin>
                <groupId>pl.project13.maven</groupId>
                <artifactId>git-commit-id-plugin</artifactId>
                <executions>
                    <execution>
                        <goals>
                            <goal>revision</goal>
                        </goals>
                    </execution>
                </executions>
                <configuration>
                    <dateFormat>yyyy-MM-dd HH:mm:ss</dateFormat>
                    <generateGitPropertiesFile>true</generateGitPropertiesFile>
                    <generateGitPropertiesFilename>${project.build.directory}/wode/public/git-commit.properties</generateGitPropertiesFilename>
                    <format>properties</format>
                    <includeOnlyProperties>
                        <property>git.remote.origin.url</property>
                        <property>git.branch</property>
                        <property>git.commit.id</property>
                        <property>git.commit.time</property>
                    </includeOnlyProperties>
                </configuration>
            </plugin>  


- deploy快照文件
mvn deploy:deploy-file -Dfile=inceptor-service-8.8.1.jar -DgroupId=com.transwarp -DartifactId=inceptor-service -Dversion=8.8.1-SNAPSHOT -Dpackaging=jar -Durl=http://x.0.x.78/repository/dev-snapshots -DrepositoryId=dev-snapshots

华为mrs用到的版本jar
https://repo.huaweicloud.com/repository/maven/huaweicloudsdk

防火墙端口

firewall-cmd --zone=public --add-port=80/tcp --permanent 
##（--permanent永久生效，没有此参数重启后失效）
## 关闭:
firewall-cmd --zone=public --remove-port=80/tcp --permanent

maven全量修改版本号

批量替换所有依赖的版本号
mvn versions:set -DnewVersion=1.0.0-SNAPSHOT
mvn versions:commit

另外也可以用变量
<properties>
        <revision>0.3.7-XTY-SNAPSHOT</revision>
 </properties>    
            <plugin>
                <groupId>org.codehaus.mojo</groupId>
                <artifactId>flatten-maven-plugin</artifactId>
                <version>1.1.0</version>
                <configuration>
                    <updatePomFile>true</updatePomFile>
                    <flattenMode>resolveCiFriendliesOnly</flattenMode>
                </configuration>
                <executions>
                    <execution>
                        <id>flatten</id>
                        <phase>process-resources</phase>
                        <goals>
                            <goal>flatten</goal>
                        </goals>
                    </execution>
                    <execution>
                        <id>flatten.clean</id>
                        <phase>clean</phase>
                        <goals>
                            <goal>clean</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>

自动生成git文件
<plugin>
                <groupId>pl.project13.maven</groupId>
                <artifactId>git-commit-id-plugin</artifactId>
                <executions>
                    <execution>
                        <goals>
                            <goal>revision</goal>
                        </goals>
                    </execution>
                </executions>
                <configuration>
                    <dateFormat>yyyy-MM-dd HH:mm:ss</dateFormat>
                    <generateGitPropertiesFile>true</generateGitPropertiesFile>
                    <generateGitPropertiesFilename>${project.build.directory}/public/git-commit.properties</generateGitPropertiesFilename>
                    <format>properties</format>
                    <includeOnlyProperties>
                        <property>git.remote.origin.url</property>
                        <property>git.branch</property>
                        <property>git.commit.id</property>
                        <property>git.commit.time</property>
                    </includeOnlyProperties>
                </configuration>
            </plugin>

maven在shade时引用本地jar

大家都知道 <scope>system<scope> 标注的坐标是本地的，打包的时候有多种方法shade
现在分享一种，比较简便的方式，直接上pom
  <dependencies>
            <dependency>
            <groupId>${project.groupId}</groupId>
            <artifactId>${artifactId}-kingbase8-8.6.0.jar</artifactId>
            <version>${version}</version>
        </dependency>
    </dependencies>
    上面的kingbase文件在我项目里面。没有上传到仓库
<build>
        <plugins>
             <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>3.2.1</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <createDependencyReducedPom>false</createDependencyReducedPom>
                            <artifactSet>
                                <includes>
                                    <include>${groupId}:${artifactId}-kingbase8-8.6.0.jar</include>
                                </includes>
                            </artifactSet>
                            <relocations>
                                <relocation>
                                    <pattern>com.kingbase8</pattern>
                                    <shadedPattern>uniq.com.kingbase8r6</shadedPattern>
                                </relocation>
        									  </relocations>
                            <filters>
                                <filter>
                                    <artifact>*:*</artifact>
                                    <excludes>
                                        <exclude>META-INF/maven/**</exclude>
                                    </excludes>
                                </filter>
                            </filters>
                        </configuration>
                    </execution>
                </executions>
            </plugin>

# 上面是shade插件，直接把kingbase包的package重命名，防止包冲突
# 下面是重点讲的 addjars-maven-plugin 这玩意儿可以把
# ${basedir}/lib 下的所有文件都install到本地仓库，
# 命名规则就是上面我 dependency 的样子，整个过程不需要写 systemPath那些东西
             <plugin>
                <groupId>com.googlecode.addjars-maven-plugin</groupId>
                <artifactId>addjars-maven-plugin</artifactId>
                <version>1.0.5</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>add-jars</goal>
                        </goals>
                        <configuration>
                            <resources>
                                <resource>
                                    <directory>${basedir}/lib</directory>
                                </resource>
                            </resources>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
</build>

查找jar中的文件

find ./ -name "*.jar"  -print | xargs grep "JaasContext"

kafka

启动zookeeper  bin/zkServer.sh start
启动kafka  bin/kafka-server-start.sh -daemon config/server.properties

关闭kafka打印日志spark/conf/log4j.properties
log4j.rootCategory=INFO, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
log4j.logger.org.apache.kafka.clients.consumer.internals.SubscriptionState = ERROR

生产
bin/kafka-console-producer.sh --bootstrap-server x.x.x.x:9092 --topic cdctest_new 
消费
bin/kafka-console-consumer.sh  --bootstrap-server  x.x.x.x:9092 --from-beginning --topic cdctest_new

CDH kafka连不上  需要在管理端kafka.brokers配置 advertised.listeners=PLAINTEXT://node01:9092