利用waterdrop将hdfs里的数据快速迁移到clickhouse中(单机版)

启动waterdrop

./bin/start-waterdrop.sh --master local[4] --deploy-mode client --config ./config/streaming.conf

注:这里面的local[4]中的是代表本机线程个数,这个是自己确定,这里为4个线程;后面的配置文件也是自己进行选择,上面是为了做流式计算,故而streaming.conf

这里需要提前说明的是,对spark的版本是要求的,之前用的2.11就报错了:

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/sql/sources/v2/StreamWrit
	at java.lang.ClassLoader.defineClass1(Native Method)
	at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
	at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
	at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
	at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:348)
	at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:370)
	at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
	at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
	at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
	at scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247)
	at scala.collection.TraversableLike$class.filter(TraversableLike.scala:259)
	at scala.collection.AbstractTraversable.filter(Traversable.scala:104)
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:5
	at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource
	at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:86)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:325
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:135)
	at io.github.interestinglab.waterdrop.input.batch.File.fileReader(File.scala:79)
	at io.github.interestinglab.waterdrop.input.batch.Hdfs.getDataset(Hdfs.scala:12)
	at io.github.interestinglab.waterdrop.Waterdrop$$anonfun$registerTempView$1.apply(Waterdrop.s
	at io.github.interestinglab.waterdrop.Waterdrop$$anonfun$registerTempView$1.apply(Waterdrop.s
	at scala.collection.immutable.List.foreach(List.scala:381)
	at io.github.interestinglab.waterdrop.Waterdrop$.registerTempView(Waterdrop.scala:248)
	at io.github.interestinglab.waterdrop.Waterdrop$.batchProcessing(Waterdrop.scala:185)
	at io.github.interestinglab.waterdrop.Waterdrop$.io$github$interestinglab$waterdrop$Waterdrop
	at io.github.interestinglab.waterdrop.Waterdrop$$anonfun$1.apply$mcV$sp(Waterdrop.scala:35)
	at io.github.interestinglab.waterdrop.Waterdrop$$anonfun$1.apply(Waterdrop.scala:35)
	at io.github.interestinglab.waterdrop.Waterdrop$$anonfun$1.apply(Waterdrop.scala:35)
	at scala.util.Try$.apply(Try.scala:192)
	at io.github.interestinglab.waterdrop.Waterdrop$.main(Waterdrop.scala:35)
	at io.github.interestinglab.waterdrop.Waterdrop.main(Waterdrop.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSub
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.sources.v2.StreamWriteSupport
	at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	... 53 more

    后来和waterdrop的开发者聊了一下,他说最好使用2.3以上,因为有的东西2.3以下是没有的,因此我这个报错了,后来换成了2.3.3。果然问题就解决了。

    有了waterdrop后,数据迁移变得很简单,而且很快,毕竟是用Scala开发的插件。只需要改变一个配置文件既可,文件为batch.xml,在/waterdrop-1.3.8/config下面。是做批处理的。waterdrop做的很人性化流式处理和批处理专门分开成了两个配置文件,按照自己需要选择既可。这里选择的是batch这配置文件。我把配置文件贴出来:

######
###### This config file is a demonstration of batch processing in waterdrop config
######
######waterdrop分为spark,input,filter,output这四大块,因为waterdrop
是基于spark进行计算的,因此这里需要配置一下spark;再者,后面的三个就是经典是输入-中间件-输出。大数据阶段很多都是这样的。input就是我们的数据源,在这里就是hdfs,而output是输出,在这里就是clickhouse;而filter是过滤,waterdrop提供了特别多的过滤。
######

spark {
  # You can set spark configuration here
  # see available properties defined by spark: https://spark.apache.org/docs/latest/configuration.html#available-properties
  spark.app.name = "Waterdrop"
  spark.executor.instances = 2
  spark.executor.cores = 1
  spark.executor.memory = "1g"
}

input {
  # This is a example input plugin **only for test and demonstrate the feature input plugin**
#  fake {
 #   table_name = "my_dataset"
 # }
######waterdrop提供了很多input数据源,这里我们这是使用的其中一种:hdfs。
######
hdfs {
     table_name = "bi_corps_logs3"
##数据源路径
     path = "hdfs://mr1/data/int/logs/2019-07-21/2/bi_corps_logs3"
##数据的格式,waterdrop提供的结构很多,因为数据刚好是text,所以
     format = "text"
   }

  # You can also use other input plugins, such as hdfs
  # hdfs {
  #   table_name = "accesslog"
  #   path = "hdfs://hadoop-cluster-01/nginx/accesslog"
  #   format = "json"
  # }

  # If you would like to get more information about how to configure waterdrop and see full list of input plugins,
  # please go to https://interestinglab.github.io/waterdrop/#/zh-cn/configuration/base
}
#######过滤器得好好说道一下,如果不设置就会报错,之前以为自己的数据并不需要处理,因为我们不需要数据清洗,我大大错了,开发者给我说,如果不设置filter的话,人家根本不知道input上的字段,而不是简单的和后面的output进行对应起来的。并不是跟我想象的只是过滤那么简单,而且我们的数据里面本来又不是表,在hdfs上面只是安装逗号进行隔开的数据,因此我才有了split这个进行处理,当然自己得按照需求进行改动,如果不设置clickhouse的表虽然没什么影响,但是里面是没有数据的。
######
filter {
  # split data by specific delimiter
 split {
    fields = ["corps_id", "corps_name", "old_owner_id", "create_time", "new_owner_id", "request_time", "role_vip", "count","role_level","corps_level"]
    delimiter = ", "
 }
  # you can also you other filter plugins, such as sql
  # sql {
  #   table_name = "accesslog"
  #   sql = "select * from accesslog where request_time > 1000"
  # }

  # If you would like to get more information about how to configure waterdrop and see full list of filter plugins,
  # please go to https://interestinglab.github.io/waterdrop/#/zh-cn/configuration/base
}

output {
  # choose stdout output plugin to output data to console
#  stdout {
# }
clickhouse {

        host = "192.168.12.129:8123"
        database = "waterdrop"
        table = "bi_corps_logs3"
        fields = ["corps_id", "corps_name", "old_owner_id", "create_time", "new_owner_id", "request_time", "role_vip", "count","role_level","corps_level"]
       # username = "username"
       # password = "password"
    }

  # you can also you other output plugins, such as sql
  # hdfs {
  #   path = "hdfs://hadoop-cluster-01/nginx/accesslog_processed"
  #   save_mode = "append"
  # }

  # If you would like to get more information about how to configure waterdrop and see full list of output plugins,
  # please go to https://interestinglab.github.io/waterdrop/#/zh-cn/configuration/base
}

没有设置filter的运行结果:

 设置后的运行结果:

也就是我们已经成功把hdfs数据导入到了clickhouse中啦。

当然在运行之前肯定需要在clickhouse中进行建库建表,不然数据能迁移到哪里呢?

create database waterdrop;
use waterdrop;
CREATE TABLE bi_corps_logs3(corps_id String, corps_name String, old_owner_id String, create_time String, new_owner_id String, request_time String, role_vip String, count String, role_level String, corps_level String) ENGINE = Memory

waterdrop支持很多数据源,这里只测了一下hdfs

参考文章:

[1] https://blog.csdn.net/huochen1994/article/details/83827587(这个就是waterdrop开发者写的,详细)

[2]https://interestinglab.github.io/waterdrop/#/zh-cn/(waterdrop官方文档)

[3] https://github.com/InterestingLab/waterdrop(这是waterdrop开发者的GitHub)

  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值