一、介绍
这一步主要是做在 数据处理 和 数仓搭建 之前,让你理清源数据之间的联系,为后期的 数据清洗分析 和 算法模型选型 做准备,其大致要点如下:
1.了解源数据的 数据结构、数据类型、字段名 等。
2.查看源数据是否有特殊值,例如:空、空格、None 等。
3.查看源数据是否有时间列,查看数据格式(时间戳、格式日期等),查看是否存在特殊字符,例如:Z、T、++ 等。
4.查看字段内是否有重复值,并了解重复字段与其余字段的关系。
5.查看源数据字段在后期数据处理时,是否需要进行 行转列、列转行 。
6.查看源数据字段内,是否存在多种状态的值,例如:是、否、yes、no、maybe 等。
7.记住原表除表头外,有多少行;简单清洗后,出表头外,有多少行。这是为了后期无论是在建表,还是在数据清洗的时候,核实数据总量 。
二、读取数据
这边采用的是 spark 来读取数据。
- 隐式函数
import spark.implicits._
- 读取用户表
val df = spark.read.format("csv")
.option("header", "true")
.load("hdfs://192.168.59.200:9000/events/data/users.csv")
.cache()
38209条
- 读取朋友表
val fr = spark.read.format("csv")
.option("header", "true")
.load("hdfs://192.168.59.200:9000/events/data/user_friends.csv")
.cache()
38202条
- 读取会议表
val eve = spark.read.format("csv")
.option("header", "true")
.load("hdfs://192.168.59.200:9000/events/data/events.csv")
.cache()
3137972条
- 读取会议参与表
val ea = spark.read.format("csv")
.option("header", "true")
.load("hdfs://192.168.59.200:9000/events/data/event_attendees.csv")
.cache()
24144条
- 读取训练表
val tr = spark.read.format("csv")
.option("header", "true")
.load("hdfs://192.168.59.130:9000/events/data/train.csv")
.cache()
15398条
- 读取测试表
val tr = spark.read.format("csv")
.option("header", "true")
.load("hdfs://192.168.59.130:9000/events/data/test.csv")
.cache()
10237条
三、数据探索
1.找出用户语言栏为空的,然后找到朋友表,关联起来,然后找出朋友表中哪种语言最多。
//先查出为空的人
val tmp = df.filter($"locale".isNull).withColumnRenamed("user_id","user")
.join(fr.Seq("user"),"inner").select("user","friends")
.withColumn("friendid",explode(split($"friends"," ")))
.drop("friends")
val wnd = Window.partitonBy("user").orderBy(desc("peoplenum"))
val tmp2 = tmp.join(df,$"friendid"===$"user_id","inner")
.select("user","locale").groupBy("user","locale".alias("loc"))
.agg(count($"locale").alias("peoplenum"))
.select($"user",$"locale".alias("locale"),row_number.over(wnd).alias("rank"))
.filter("rank==1").cache()
df.join(tmp2,$"user"===$"user_id","left")
.withColumn("localename",coalesce("locale","loc"))
.drop("locale","loc","user","rank")
.show()
2.把birthyear中的None去掉,过滤一下符合的年份,算出个数。
df.filter("birthyear rlike '19[5-9][0-9]|20[0-1][0-9]|2020'")
.agg(count($"birthyear")).show
//也可以用cast(IntegerType)
df.withColumn("birthyear",$"birthyear".cast(IntegerType)).agg(count($"birthyear"))
//向下取整取年龄的平均值
val pj = df.filter("birthyear rlike '19[5-9][0-9]|20[0-1][0-9]|2020'")
.agg(floor(avg($"birthyear")).alias("pjyear"))
//UDF
val calAges = udf((age:String,defyear:String)=>{
val reg = "^(19[5-9][0-9]|20[0-1][0-9]|2020)$".r
if(reg.findAllMatchIn(age).hasNext) {
age
}else{
defyear
}
})
//把birthyear中的None替换成符合年份的平均值,这种替换是通过新增列来表达
df.crossJoin(pj).withColumn("byear",calAges($"birthyear",$"pjyear")).show()
3.查看性别。
df.groupBy("gender").agg(max("gender")).show()
//发现有null,则用unknown替换
df.withColumn("gender",
when($"gender".isNull,"unknown").otherwise($"gender"))
//UDF函数
val genderUnique=udf((sex:String)=>{
if(sex == null || sex.equals("None")){
"unknown"
}else if(sex.equals("男") || sex.equals("1")){
"male"
}else{
"female"
}
})
df.withColumn("gender",genderUnique($"gender"))
4.找出入会时间是null的。
df.filter($"joinedAt".isNull).count()
//regexp_extract的括号问题
df.withColumn("joinedAt",unix_timestamp(
regexp_extract($"joinedAt","(.*T.*).[0-9]{3}Z.*",1),"yyyy-MM-ddTHH:mm:ss"))
//省去毫秒数,将日期转化成:年月日 时分秒
val ymd = udf((str:String)=>{
if(str == null || str.equals("None")){
null
}else{
val tms = str.split("[T.]")
tms(0)+" "+tms(1)
}
})
//把格式日期转为时间戳,算出平均值后,再转回来,然后把不是规范的日期替换成平均日期
val pjTime = df.withColumn("joinedAt",unix_timestamp(ymd($"joinedAt")))
.agg(floor(avg($"joinedAt")).alias("pjtime"))
.select(from_unixtime($"pjtime").alias("pjtime"))
df.crossJoin(pjTime)
.withColumn("joinedAt",ymd($"joinedAt"))
.withColumn("joinedAt",
when($"joinedAt".isNull,$"pjTime")
.otherwise($"joinedAt"))
.drop("pjTime").show(60,false)
5.找到重复events。
println(eve.groupBy($"event_id").agg(count($"event_id")).count())
eve.agg(countDistinct("event_id")).show()
6.找出哪个主持人主持的场次最多,并算出最多的场数。
eve.groupBy($"user_id")
.agg(count($"user_id").alias("changci"))
.orderBy(desc("changci")).show()
7.找出没有出现在用户表里的,但是出现在会议里的用户id的个数。
println(eve.select("user_id").distinct()
.join(df.select("user_id", "locale").distinct(), Seq("user_id"), "left")
.filter($"locale".isNull).count())
8.一个用户对一个朋友(行转列)。
fr.createOrReplaceTempView("user_friends")
spark.sql(
"""
|select user,friend
|from user_friends
|lateral view explode(split(friends,' ')) f as friend
|""".stripMargin).show()
//给朋友id加个下标
fr.createOrReplaceTempView("user_friends")
spark.sql(
"""
|select user,friend,index
|from user_friends
|lateral view posexplode(split(friends,' ')) f as index,friend
|""".stripMargin).show()
fr.select($"user".alias("user_id")
,explode(split($"friends"," ")).alias("friend_id"))
.drop("friends")
fr.withColumn("friends",explode(split($"friends"," "))).show()
9.把event_attend每件事的用户分开来。
//spark写
val yes = ea.select($"event".alias("event_id")
,explode(split($"yes"," ")).alias("user_id")
,lit("yes").alias("act_type"))
val maybe = ea.select($"event".alias("event_id")
,explode(split($"maybe"," ")).alias("user_maybe")
,lit("maybe").alias("act_type"))
val invited = ea.select($"event".alias("event_id")
,explode(split($"invited"," ")).alias("user_invited")
,lit("invited").alias("act_type"))
val no = ea.select($"event".alias("event_id")
,explode(split($"no"," ")).alias("user_no")
,lit("no").alias("act_type"))
println(yes.union(maybe).union(invited).union(no).distinct().count())
//hive写
create database interests;
use interests;
create external table ext_attendees(
event string,
yes string,
maybe string,
invited string,
no string
)
row format delimited fields terminated by ','
location '/events/data/attendees'
tblproperties("skip.header.line.count"="1");
with
t1 as(select event,yes from ext_attendees
lateral view explode(split(yes,' ')) y as yes),
t2 as(select event,maybe from ext_attendees
lateral view explode(split(maybe,' ')) m as maybe),
t3 as(select event,invited from ext_attendees
lateral view explode(split(invited,' ')) i as invited),
t4 as(select event,no from ext_attendees
lateral view explode(split(no,' '))n as no)
select t1.event,t1.yes,t2.maybe,t3.invited,t4.no from t1 union t2 union t3 union t4;
select count(*) from
(
select distinct event event_id,user_id,'yes' as action
from ext_attendees
lateral view explode(split(yes,' '))a as user_id
union all
select distinct event event_id,user_id,'maybe' as action
from ext_attendees
lateral view explode(split(maybe,' '))a as user_id
union all
select distinct event event_id,user_id,'invited' as action
from ext_attendees
lateral view explode(split(invited,' '))a as user_id
union all
select distinct event event_id,user_id,'no' as action
from ext_attendees
lateral view explode(split(no,' '))a as user_id
)f
where f.user_id is not null;
10.找出train表中用户最后一场会议是哪场。
tr.select("user","event").distinct().count()
val wnd = Window.partitionBy($"user", $"event").orderBy(desc("timestamp"))
tr.withColumn("timestramp",invitedTime($"timestramp"))
.select($"user",$"event",$"invited",$"timestamp",$"interested",$"not_interested"
,row_number().over(wnd).alias("rank")).filter("rank==1")
.agg(count("user")).show()
四、POM文件
- 依赖添加
<!-- spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.3.4</version>
</dependency>
<!-- hadoop-client -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.6.0</version>
</dependency>
<!-- spark-sql -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.3.4</version>
</dependency>
<!-- guava -->
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>14.0.1</version>
</dependency>