启动waterdrop:
./bin/start-waterdrop.sh --master local[4] --deploy-mode client --config ./config/streaming.conf
注:这里面的local[4]中的是代表本机线程个数,这个是自己确定,这里为4个线程;后面的配置文件也是自己进行选择,上面是为了做流式计算,故而streaming.conf
这里需要提前说明的是,对spark的版本是要求的,之前用的2.11就报错了:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/sql/sources/v2/StreamWrit
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:370)
at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247)
at scala.collection.TraversableLike$class.filter(TraversableLike.scala:259)
at scala.collection.AbstractTraversable.filter(Traversable.scala:104)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:5
at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource
at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:86)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:325
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:135)
at io.github.interestinglab.waterdrop.input.batch.File.fileReader(File.scala:79)
at io.github.interestinglab.waterdrop.input.batch.Hdfs.getDataset(Hdfs.scala:12)
at io.github.interestinglab.waterdrop.Waterdrop$$anonfun$registerTempView$1.apply(Waterdrop.s
at io.github.interestinglab.waterdrop.Waterdrop$$anonfun$registerTempView$1.apply(Waterdrop.s
at scala.collection.immutable.List.foreach(List.scala:381)
at io.github.interestinglab.waterdrop.Waterdrop$.registerTempView(Waterdrop.scala:248)
at io.github.interestinglab.waterdrop.Waterdrop$.batchProcessing(Waterdrop.scala:185)
at io.github.interestinglab.waterdrop.Waterdrop$.io$github$interestinglab$waterdrop$Waterdrop
at io.github.interestinglab.waterdrop.Waterdrop$$anonfun$1.apply$mcV$sp(Waterdrop.scala:35)
at io.github.interestinglab.waterdrop.Waterdrop$$anonfun$1.apply(Waterdrop.scala:35)
at io.github.interestinglab.waterdrop.Waterdrop$$anonfun$1.apply(Waterdrop.scala:35)
at scala.util.Try$.apply(Try.scala:192)
at io.github.interestinglab.waterdrop.Waterdrop$.main(Waterdrop.scala:35)
at io.github.interestinglab.waterdrop.Waterdrop.main(Waterdrop.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSub
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.sources.v2.StreamWriteSupport
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 53 more
后来和waterdrop的开发者聊了一下,他说最好使用2.3以上,因为有的东西2.3以下是没有的,因此我这个报错了,后来换成了2.3.3。果然问题就解决了。
有了waterdrop后,数据迁移变得很简单,而且很快,毕竟是用Scala开发的插件。只需要改变一个配置文件既可,文件为batch.xml,在/waterdrop-1.3.8/config下面。是做批处理的。waterdrop做的很人性化流式处理和批处理专门分开成了两个配置文件,按照自己需要选择既可。这里选择的是batch这配置文件。我把配置文件贴出来:
######
###### This config file is a demonstration of batch processing in waterdrop config
######
######waterdrop分为spark,input,filter,output这四大块,因为waterdrop
是基于spark进行计算的,因此这里需要配置一下spark;再者,后面的三个就是经典是输入-中间件-输出。大数据阶段很多都是这样的。input就是我们的数据源,在这里就是hdfs,而output是输出,在这里就是clickhouse;而filter是过滤,waterdrop提供了特别多的过滤。
######
spark {
# You can set spark configuration here
# see available properties defined by spark: https://spark.apache.org/docs/latest/configuration.html#available-properties
spark.app.name = "Waterdrop"
spark.executor.instances = 2
spark.executor.cores = 1
spark.executor.memory = "1g"
}
input {
# This is a example input plugin **only for test and demonstrate the feature input plugin**
# fake {
# table_name = "my_dataset"
# }
######waterdrop提供了很多input数据源,这里我们这是使用的其中一种:hdfs。
######
hdfs {
table_name = "bi_corps_logs3"
##数据源路径
path = "hdfs://mr1/data/int/logs/2019-07-21/2/bi_corps_logs3"
##数据的格式,waterdrop提供的结构很多,因为数据刚好是text,所以
format = "text"
}
# You can also use other input plugins, such as hdfs
# hdfs {
# table_name = "accesslog"
# path = "hdfs://hadoop-cluster-01/nginx/accesslog"
# format = "json"
# }
# If you would like to get more information about how to configure waterdrop and see full list of input plugins,
# please go to https://interestinglab.github.io/waterdrop/#/zh-cn/configuration/base
}
#######过滤器得好好说道一下,如果不设置就会报错,之前以为自己的数据并不需要处理,因为我们不需要数据清洗,我大大错了,开发者给我说,如果不设置filter的话,人家根本不知道input上的字段,而不是简单的和后面的output进行对应起来的。并不是跟我想象的只是过滤那么简单,而且我们的数据里面本来又不是表,在hdfs上面只是安装逗号进行隔开的数据,因此我才有了split这个进行处理,当然自己得按照需求进行改动,如果不设置clickhouse的表虽然没什么影响,但是里面是没有数据的。
######
filter {
# split data by specific delimiter
split {
fields = ["corps_id", "corps_name", "old_owner_id", "create_time", "new_owner_id", "request_time", "role_vip", "count","role_level","corps_level"]
delimiter = ", "
}
# you can also you other filter plugins, such as sql
# sql {
# table_name = "accesslog"
# sql = "select * from accesslog where request_time > 1000"
# }
# If you would like to get more information about how to configure waterdrop and see full list of filter plugins,
# please go to https://interestinglab.github.io/waterdrop/#/zh-cn/configuration/base
}
output {
# choose stdout output plugin to output data to console
# stdout {
# }
clickhouse {
host = "192.168.12.129:8123"
database = "waterdrop"
table = "bi_corps_logs3"
fields = ["corps_id", "corps_name", "old_owner_id", "create_time", "new_owner_id", "request_time", "role_vip", "count","role_level","corps_level"]
# username = "username"
# password = "password"
}
# you can also you other output plugins, such as sql
# hdfs {
# path = "hdfs://hadoop-cluster-01/nginx/accesslog_processed"
# save_mode = "append"
# }
# If you would like to get more information about how to configure waterdrop and see full list of output plugins,
# please go to https://interestinglab.github.io/waterdrop/#/zh-cn/configuration/base
}
没有设置filter的运行结果:
设置后的运行结果:
也就是我们已经成功把hdfs数据导入到了clickhouse中啦。
当然在运行之前肯定需要在clickhouse中进行建库建表,不然数据能迁移到哪里呢?
create database waterdrop;
use waterdrop;
CREATE TABLE bi_corps_logs3(corps_id String, corps_name String, old_owner_id String, create_time String, new_owner_id String, request_time String, role_vip String, count String, role_level String, corps_level String) ENGINE = Memory
waterdrop支持很多数据源,这里只测了一下hdfs
参考文章:
[1] https://blog.csdn.net/huochen1994/article/details/83827587(这个就是waterdrop开发者写的,详细)
[2]https://interestinglab.github.io/waterdrop/#/zh-cn/(waterdrop官方文档)
[3] https://github.com/InterestingLab/waterdrop(这是waterdrop开发者的GitHub)