[ Project ] Intes Data Explored

本文介绍了使用Spark进行数据探索的步骤,包括检查数据结构、特殊值处理、时间列分析、重复值查找及数据格式转换。通过实例展示了如何处理用户语言、出生年份、性别、会议参与等关键信息,以及使用POM文件管理依赖。
摘要由CSDN通过智能技术生成

一、介绍

  这一步主要是做在 数据处理数仓搭建 之前,让你理清源数据之间的联系,为后期的 数据清洗分析算法模型选型 做准备,其大致要点如下:

  1.了解源数据的 数据结构数据类型字段名 等。

  2.查看源数据是否有特殊值,例如:空格None 等。

  3.查看源数据是否有时间列,查看数据格式(时间戳、格式日期等),查看是否存在特殊字符,例如:ZT++ 等。

  4.查看字段内是否有重复值,并了解重复字段与其余字段的关系。

  5.查看源数据字段在后期数据处理时,是否需要进行 行转列列转行

  6.查看源数据字段内,是否存在多种状态的值,例如:yesnomaybe 等。

  7.记住原表除表头外,有多少行;简单清洗后,出表头外,有多少行。这是为了后期无论是在建表,还是在数据清洗的时候,核实数据总量

二、读取数据

  这边采用的是 spark 来读取数据。

  • 隐式函数
import spark.implicits._
  • 读取用户表
val df = spark.read.format("csv")
      .option("header", "true")
      .load("hdfs://192.168.59.200:9000/events/data/users.csv")
      .cache()
38209
  • 读取朋友表
val fr = spark.read.format("csv")
      .option("header", "true")
      .load("hdfs://192.168.59.200:9000/events/data/user_friends.csv")
      .cache()
38202
  • 读取会议表
val eve = spark.read.format("csv")
      .option("header", "true")
      .load("hdfs://192.168.59.200:9000/events/data/events.csv")
      .cache()
3137972
  • 读取会议参与表
val ea = spark.read.format("csv")
      .option("header", "true")
      .load("hdfs://192.168.59.200:9000/events/data/event_attendees.csv")
      .cache()
24144
  • 读取训练表
val tr = spark.read.format("csv")
      .option("header", "true")
      .load("hdfs://192.168.59.130:9000/events/data/train.csv")
      .cache()	 
15398
  • 读取测试表
val tr = spark.read.format("csv")
      .option("header", "true")
      .load("hdfs://192.168.59.130:9000/events/data/test.csv")
      .cache()	 
10237

三、数据探索

  1.找出用户语言栏为空的,然后找到朋友表,关联起来,然后找出朋友表中哪种语言最多。

//先查出为空的人

val tmp = df.filter($"locale".isNull).withColumnRenamed("user_id","user")
	.join(fr.Seq("user"),"inner").select("user","friends")
	.withColumn("friendid",explode(split($"friends"," ")))
	.drop("friends")

val wnd = Window.partitonBy("user").orderBy(desc("peoplenum"))

val tmp2 = tmp.join(df,$"friendid"===$"user_id","inner")
	.select("user","locale").groupBy("user","locale".alias("loc"))
	.agg(count($"locale").alias("peoplenum"))
	.select($"user",$"locale".alias("locale"),row_number.over(wnd).alias("rank"))
	.filter("rank==1").cache()
	
df.join(tmp2,$"user"===$"user_id","left")
	.withColumn("localename",coalesce("locale","loc"))
	.drop("locale","loc","user","rank")
	.show()

  2.把birthyear中的None去掉,过滤一下符合的年份,算出个数。

df.filter("birthyear rlike '19[5-9][0-9]|20[0-1][0-9]|2020'")
	.agg(count($"birthyear")).show
	
//也可以用cast(IntegerType)
df.withColumn("birthyear",$"birthyear".cast(IntegerType)).agg(count($"birthyear"))
	
//向下取整取年龄的平均值
val pj = df.filter("birthyear rlike '19[5-9][0-9]|20[0-1][0-9]|2020'")
	.agg(floor(avg($"birthyear")).alias("pjyear"))

//UDF
val calAges = udf((age:String,defyear:String)=>{
	val reg = "^(19[5-9][0-9]|20[0-1][0-9]|2020)$".r
	if(reg.findAllMatchIn(age).hasNext) {
		age
	}else{
		defyear
	}
})

//把birthyear中的None替换成符合年份的平均值,这种替换是通过新增列来表达
df.crossJoin(pj).withColumn("byear",calAges($"birthyear",$"pjyear")).show()

  3.查看性别。

df.groupBy("gender").agg(max("gender")).show()
//发现有null,则用unknown替换
df.withColumn("gender",
	when($"gender".isNull,"unknown").otherwise($"gender"))

//UDF函数
val genderUnique=udf((sex:String)=>{
	if(sex == null || sex.equals("None")){
		"unknown"
	}else if(sex.equals("男") || sex.equals("1")){
		"male"
	}else{
		"female"
	}
})

df.withColumn("gender",genderUnique($"gender"))

  4.找出入会时间是null的。

df.filter($"joinedAt".isNull).count()

//regexp_extract的括号问题
df.withColumn("joinedAt",unix_timestamp(
      regexp_extract($"joinedAt","(.*T.*).[0-9]{3}Z.*",1),"yyyy-MM-ddTHH:mm:ss"))

//省去毫秒数,将日期转化成:年月日 时分秒
val ymd = udf((str:String)=>{
if(str == null || str.equals("None")){
  null
}else{
  val tms = str.split("[T.]")
  tms(0)+" "+tms(1)
}
})

//把格式日期转为时间戳,算出平均值后,再转回来,然后把不是规范的日期替换成平均日期
val pjTime = df.withColumn("joinedAt",unix_timestamp(ymd($"joinedAt")))
  .agg(floor(avg($"joinedAt")).alias("pjtime"))
  .select(from_unixtime($"pjtime").alias("pjtime"))
  
df.crossJoin(pjTime)
	.withColumn("joinedAt",ymd($"joinedAt"))
	.withColumn("joinedAt",
		when($"joinedAt".isNull,$"pjTime")
		.otherwise($"joinedAt"))
	.drop("pjTime").show(60,false)

  5.找到重复events。

println(eve.groupBy($"event_id").agg(count($"event_id")).count())

eve.agg(countDistinct("event_id")).show()

  6.找出哪个主持人主持的场次最多,并算出最多的场数。

eve.groupBy($"user_id")
.agg(count($"user_id").alias("changci"))
.orderBy(desc("changci")).show()

  7.找出没有出现在用户表里的,但是出现在会议里的用户id的个数。

println(eve.select("user_id").distinct()
      .join(df.select("user_id", "locale").distinct(), Seq("user_id"), "left")
      .filter($"locale".isNull).count())

  8.一个用户对一个朋友(行转列)。

fr.createOrReplaceTempView("user_friends")
    spark.sql(
      """
        |select user,friend
        |from user_friends
        |lateral view explode(split(friends,' ')) f as friend
        |""".stripMargin).show()

//给朋友id加个下标
fr.createOrReplaceTempView("user_friends")
    spark.sql(
      """
        |select user,friend,index
        |from user_friends
        |lateral view posexplode(split(friends,' ')) f as index,friend
        |""".stripMargin).show()

fr.select($"user".alias("user_id")
	,explode(split($"friends"," ")).alias("friend_id"))
	.drop("friends")

fr.withColumn("friends",explode(split($"friends"," "))).show()

  9.把event_attend每件事的用户分开来。

//spark写
val yes = ea.select($"event".alias("event_id")
  ,explode(split($"yes"," ")).alias("user_id")
  ,lit("yes").alias("act_type"))
val maybe = ea.select($"event".alias("event_id")
  ,explode(split($"maybe"," ")).alias("user_maybe")
  ,lit("maybe").alias("act_type"))
val invited = ea.select($"event".alias("event_id")
  ,explode(split($"invited"," ")).alias("user_invited")
  ,lit("invited").alias("act_type"))
val no = ea.select($"event".alias("event_id")
  ,explode(split($"no"," ")).alias("user_no")
  ,lit("no").alias("act_type"))
println(yes.union(maybe).union(invited).union(no).distinct().count())
	
//hive写
create database interests;
use interests;
create external table ext_attendees(
event string,
yes string,
maybe string,
invited string,
no string
)
row format delimited fields terminated by ','
location '/events/data/attendees'
tblproperties("skip.header.line.count"="1");
	
with
t1 as(select event,yes from ext_attendees
lateral view explode(split(yes,' ')) y as yes),
t2 as(select event,maybe from ext_attendees
lateral view explode(split(maybe,' ')) m as maybe),
t3 as(select event,invited from ext_attendees
lateral view explode(split(invited,' ')) i as invited),
t4 as(select event,no from ext_attendees
lateral view explode(split(no,' '))n as no)
select t1.event,t1.yes,t2.maybe,t3.invited,t4.no from t1 union t2 union t3 union t4;


select count(*) from
(
select distinct event event_id,user_id,'yes' as action
from ext_attendees
lateral view explode(split(yes,' '))a as user_id
union all
select distinct event event_id,user_id,'maybe' as action
from ext_attendees
lateral view explode(split(maybe,' '))a as user_id
union all
select distinct event event_id,user_id,'invited' as action
from ext_attendees
lateral view explode(split(invited,' '))a as user_id
union all
select distinct event event_id,user_id,'no' as action
from ext_attendees
lateral view explode(split(no,' '))a as user_id
)f
where f.user_id is not null;

  10.找出train表中用户最后一场会议是哪场。

tr.select("user","event").distinct().count()

val wnd = Window.partitionBy($"user", $"event").orderBy(desc("timestamp"))

tr.withColumn("timestramp",invitedTime($"timestramp"))
	.select($"user",$"event",$"invited",$"timestamp",$"interested",$"not_interested"
	,row_number().over(wnd).alias("rank")).filter("rank==1")
	.agg(count("user")).show()

四、POM文件

  • 依赖添加
<!-- spark-core -->
<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-core_2.11</artifactId>
  <version>2.3.4</version>
</dependency>

<!-- hadoop-client -->
<dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-client</artifactId>
  <version>2.6.0</version>
</dependency>

<!-- spark-sql -->
<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-sql_2.11</artifactId>
  <version>2.3.4</version>
</dependency>

<!-- guava -->
<dependency>
  <groupId>com.google.guava</groupId>
  <artifactId>guava</artifactId>
  <version>14.0.1</version>
</dependency>
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值