kafka实际应用--读取数据,并用java实现业务逻辑“行转列"
- 一、业务需求
- 二、业务实现
- 2.1 kafka中创建topic: event_attendees_raw
- 2.2 创建 flume 读取数据的配置文件
- 2.3 flume根目录下执行flume配置文件
- 2.4 复制 event_attendees.csv 至flume指定读取文件的目录
- 2.5 查看topic:event_attendees_raw 消息队列数量
- 2.6 创建一个新的topic,用于存放行转列的数据event_attendees
- 2.7 java代码实现行转列的业务逻辑
- 2.8 查看topic:event_attendees 消息队列
- 2.9 创建消费者查看处理过之后行转列的数据
- 2.10 将kafka topic:"event_attendees"数据存储到hbase中
一、业务需求
现有一个表数据 event_attendees.csv ,表中截取前5行数据,内容如下:
表头数据:
event | yes | maybe | invited | no |
---|
代表含义为:
- event: 一个事件的名称代号,可以理解一个人的身份ID,去参加聚会,其余四个字段代表他去不去参加聚会的状态
- yes: 同意去
- maybe: 可能会去
- invited: 受邀请的
- no: 不去参加
业务需求: 将每行的事件event ,与其对应的状态一 一对应,在加上相应的状态描述,最后将处理后的数据存入到kafka中
1159822043,1975964455 252302513 4226086795 3805886383 1420484491 3831921392 3973364512,2733420590 517546982 1350834692 532087573 583146976 3079807774 1324909047,1723091036 3795873583 4109144917 3560622906 3106484834 2925436522 2284506787 2484438140 3148037960 2142928184 1682878505 486528429 3474278726 2108616219 3589560411 3637870501 1240238615 1317109108 1225824766 2934840191 2245748965 4059548655 1646990930 2361664293 3134324567 2976828530 766986159 1903653283 3090522859 827508055 140395236 2179473237 1316219101 910840851 1177300918 90902339 4099434853 2056657287 717285491 3384129768 4102613628 681694749 3536183215 1017072761 1059775837 1184903017 434306588 903024682 1971107587 3461437762 196870175 2831104766 766089257 2264643432 2868116197 25717625 595482504 985448353 4089810567 1590796286 3920433273 1826725698 3845833055 1674430344 2364895843 1127212779 481590583 1262260593 899673047 4193404875,3575574655 1077296663
上文展示的是第二行数据,现在想要将数据拆分处理后如下展示:
1159822043,1975964455,yes
1159822043,252302513,yes
1159822043,4226086795,yes
...
event_attendees.csv表数据链接:https://pan.baidu.com/s/1pY8xz0BBMMLhRbo38PXe8Q
提取码:euhn
二、业务实现
2.1 kafka中创建topic: event_attendees_raw
kafka根目录下,创建topic “event_attendees_raw”,用于存放使用 flume 读取 event_attendees.csv 的数据
bin/kafka-topics.sh --create --zookeeper 192.168.206.129:2181 --topic event_attendees_raw --partitions 1 --replication-factor 1
2.2 创建 flume 读取数据的配置文件
vi event_attendees-flume-kafka.conf
编辑内容如下:
event_attendees.sources = eventAttendeesSource
event_attendees.channels = eventAttendeesChannel
event_attendees.sinks = eventAttendeesSink
event_attendees.sources.eventAttendeesSource.type = spooldir
event_attendees.sources.eventAttendeesSource.spoolDir = /opt/dataFile/flumeFile/event_attendees
event_attendees.sources.eventAttendeesSource.deserializer = LINE
event_attendees.sources.eventAttendeesSource.deserializer.maxLineLength = 60000
event_attendees.sources.eventAttendeesSource.includePattern = event_attendees_[0-9]{4}-[0-9]{2}-[0-9]{2}.csv
event_attendees.channels.eventAttendeesChannel.type = file
event_attendees.channels.eventAttendeesChannel.checkpointDir = /opt/dataFile/flumeFile/checkpoint/event_attendees
event_attendees.channels.eventAttendeesChannel.dataDir = /opt/dataFile/flumeFile/data/event_attendees
event_attendees.sinks.eventAttendeesSink.type = org.apache.flume.sink.kafka.KafkaSink
event_attendees.sinks.eventAttendeesSink.batchSize = 640
event_attendees.sinks.eventAttendeesSink.brokerList = 192.168.206.129:9092
event_attendees.sinks.eventAttendeesSink.topic = event_attendees_raw
event_attendees.sources.eventAttendeesSource.channels = eventAttendeesChannel
event_attendees.sinks.eventAttendeesSink.channel = eventAttendeesChannel
2.3 flume根目录下执行flume配置文件
./bin/flume-ng agent --name event_attendees --conf conf/ --conf-file conf/job/event_attendees-flume-kafka.conf -Dflume.root.logger=INFO,console
2.4 复制 event_attendees.csv 至flume指定读取文件的目录
install event_attendees.csv /opt/dataFile/flumeFile/event_attendees/event_attendees_2020-08-24.csv
若以上都设置正确,此时 flume 即可开始读取数据
2.5 查看topic:event_attendees_raw 消息队列数量
kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list hadoop-single:9092 --topic event_attendees_raw -time -1 -offsets 1
再查看event_attendees.csv表数据的总行数
wc -l event_attendees.csv
我这里数据略有些偏差,可以忽略不影响,不能相差很大,否则无法保证数据准确性
2.6 创建一个新的topic,用于存放行转列的数据event_attendees
bin/kafka-topics.sh --create --zookeeper 192.168.206.129:2181 --topic event_attendees --partitions 1 --replication-factor 1
2.7 java代码实现行转列的业务逻辑
需添加的Maven依赖:
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>2.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-streams</artifactId>
<version>2.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.11</artifactId>
<version>2.0.0</version>
</dependency>
创建java类 MyEventAttendees ,业务代码内容如下:
package cn.kgc.events;
import org.apache.kafka.common.serialization.Serdes;
import org.apache.kafka.streams.*;
import org.apache.kafka.streams.kstream.KStream;
import java.util.ArrayList;
import java.util.List;
import java.util.Properties;
import java.util.concurrent.CountDownLatch;
public class MyEventAttendees {
public static void main(String[] args) {
Properties prop = new Properties();
prop.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG,
"192.168.206.129:9092"); //ip地址
prop.put(StreamsConfig.APPLICATION_ID_CONFIG,
"myEventAttendees"); //kafka application config的ID,随便写,执行一次需更改ID,不然有缓存可能会报错
prop.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG,
Serdes.String().getClass()); //key
prop.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG,
Serdes.String().getClass()); //value
StreamsBuilder builder = new StreamsBuilder();
//TODO 处理掉首行数据, !v.toString().startsWith("event,"))
//TODO event,yes,maybe,invited,no
final KStream<Object, Object> event_attendees_raw =
builder.stream("event_attendees_raw")
.filter((k, v) -> (
!v.toString().startsWith("event,") &&
v.toString().split(",").length == 5
));
event_attendees_raw.flatMap((k,v) ->{ // 假设 1,2,3,4,5 代表5个字段
System.out.println(k + " " + v); // 展开后为 null 1,2,3,4,5
List<KeyValue<String,String>> keyValues = new ArrayList<>();
String[] split = v.toString().split(",");
//[1,2,3,4,5] 将每一行读取到的字段再进行切割,储存到字符串类型的数组中
String event = split[0]; //event
String[] yess = split[1].split(" "); //yes
String[] maybes = split[2].split(" "); //maybe
String[] inviteds = split[3].split(" "); //invited
String[] nos = split[4].split(" "); //no
//对于[2,3,4,5]四个对象操作
for (String yes : yess) {
KeyValue<String,String> keyValue =
new KeyValue<>(null,event+","+yes+",yes");
keyValues.add(keyValue); //遍历每一个字符串数组,将取出的数据放到外层最大的列表keyValues中
}
for (String maybe : maybes) {
KeyValue<String,String> keyValue =
new KeyValue<>(null,event+","+maybe+",maybe");
keyValues.add(keyValue);
}
for (String invited : inviteds) {
KeyValue<String,String> keyValue =
new KeyValue<>(null,event+","+invited+",invited");
keyValues.add(keyValue);
}
for (String no : nos) {
KeyValue<String,String> keyValue =
new KeyValue<>(null,event+","+no+",no");
keyValues.add(keyValue);
}
return keyValues; //最后返回这个列表
}).to("event_attendees"); //将处理过的数据储存到 topic: "event_attendees" 中
//拓步结构
Topology topo = builder.build();
KafkaStreams streams = new KafkaStreams(topo, prop);
CountDownLatch countDownLatch = new CountDownLatch(1);
Runtime.getRuntime().addShutdownHook(new Thread("test01"){ //多线程运行
@Override
public void run() {
streams.close();
countDownLatch.countDown();
}
});
try {
streams.start();
countDownLatch.await();
} catch (InterruptedException e) {
e.printStackTrace();
}
System.exit(0); //执行完成退出
}
}
2.8 查看topic:event_attendees 消息队列
kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list hadoop-single:9092 --topic event_attendees -time -1 -offsets 1
显示如下信息表示数据存储成功
2.9 创建消费者查看处理过之后行转列的数据
kafka-console-consumer.sh --bootstrap-server 192.168.206.129:9092 --topic event_attendees --from-beginning
查询到数据结构如图所示说明数据处理成功(部分截图):
2.10 将kafka topic:"event_attendees"数据存储到hbase中
①hbase中创建命名空间:
create_namespace 'events_db'
②以该命名空间建表:
create 'events_db:event_attendee', 'euat'
③通过java API将kafka中数据导入habse:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.hbase.client.ConnectionFactory;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Table;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import org.apache.kafka.common.serialization.StringDeserializer;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.Properties;
/**
* @ClassName: EventAttendeeshb
* @Description: TODO 将kafka topic event_attendees数据导入到 hbase events_db:event_attendee 表中
* TODO 在hbse中建表 create_namespace 'events_db'
* create 'events_db:event_attendee', 'euat'
* event yes maybe invited no
* 0 1 2 3 4
* rowkey:calculated hash-code of event_id + user_id
* euat: event_id, user_id, attend_type
* @author: 我玩的很开心
* @date: 2020/9/7 15:15
*/
public class EventAttendeeshb {
public static void main(String[] args) {
//kafka
Properties prop = new Properties();
prop.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "192.168.206.129:9092");
prop.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
prop.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
prop.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, "30000");
prop.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "false");
prop.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
prop.put(ConsumerConfig.GROUP_ID_CONFIG, "users123456");
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(prop);
consumer.subscribe(Collections.singletonList("event_attendees"));
//hbase
Configuration config = HBaseConfiguration.create();
config.set("hbase.rootdir","hdfs://192.168.206.129:9000/hbase");
config.set("hbase.zookeeper.quorum","192.168.206.129");
config.set("hbase.zookeeper.property.clientPort","2181");
try {
Connection connection = ConnectionFactory.createConnection(config);
Table table = connection.getTable(TableName.valueOf("events_db:event_attendee"));
while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
List<Put> putList = new ArrayList<>();
for (ConsumerRecord<String, String> record : records) {
System.out.println(record);
String[] infos = record.value().split(",",-1);
if(infos.length == 3){
Put put = new Put(Bytes.toBytes((infos[0]+infos[1]).hashCode()));
if(infos[0] != null){
put.addColumn("euat".getBytes(),"event_id".getBytes(),infos[0].getBytes());
}else {
put.addColumn("euat".getBytes(),"event_id".getBytes(),null);
}
if(infos[1] != null){
put.addColumn("euat".getBytes(),"user_id".getBytes(),infos[1].getBytes());
}else{
put.addColumn("euat".getBytes(),"user_id".getBytes(),null);
}
if(infos[2] != null){
put.addColumn("euat".getBytes(),"attend_type".getBytes(),infos[2].getBytes());
}else{
put.addColumn("euat".getBytes(),"attend_type".getBytes(),null);
}
putList.add(put);
}
}
table.put(putList);
table.close();
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
④查看hbse中 events_db:event_attendee 表数据:
scan 'events_db:event_attendee'