【大数据学习-实验-6】Spark应用

1.统计有多少行符合要求

1.文档test.txt中存储了若干用户信息,一个用户的信息存储为一行数据,要求过滤出其中性别为“男”的用户,并且统计有多少行符合要求。

18375,2011-5-20,2013-6-5,,4,广州,广东,CN,25,2014-3-31,2,0,0,0,100,0,1134,0,2013-6-9,0.25,0,430,297,4,4,195,12123,1,0,0,2,0,0,0,12318,12318,12123,12318,12123,1,0,0,0,22
36041,2010-3-8,2013-9-14,,4,佛山,广东,CN,38,2014-3-31,4,0,0,0,100,0,8016,0,2014-1-3,0.5,0,531,89,37,60,50466,56506,14,0,0,4,0,0,0,106972,106972,56506,106972,56506,1,0,0,0,43
45690,2006-3-30,2006-12-2,,4,广州,广东,CN,43,2014-3-31,2,0,0,0,100,0,2594,0,2014-3-3,0.25,0,536,29,166,166,0,0,0,0,0,2,0,0,0,0,0,0,0,0,1,0,0,0,0
61027,2013-2-6,2013-2-14,,4,广州,广东,CN,36,2014-3-31,2,0,0,0,100,0,3934,0,2013-2-26,0.4,0,8,400,12,12,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,1,0,0,0
61340,2013-2-17,2013-2-17,,4,上海,.,CN,29,2014-3-31,2,0,0,0,,0,4222,0,2013-2-23,0.4,0,0,403,6,6,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,1,0,0,0

实现思路:
(1) 读取数据创建RDD。
(2) 通过filter操作过滤数据,filter的函数判断数据是否包含“男”字符,可用“contains”方法。
(3) 用count对步骤(2)的结果进行统计,得到行数。

1)准备工作

在本地创建目录/data/spark/wordcount

mkdir -p /data/spark/wordcount

在目录中创建文档 test.txt

cd /data/spark/wordcount
vim test.txt

test文档中写入,由于系统不能识别中文,将男改为M,女改为W

18375,2011-5-20,2013-6-5,W,4,GZ,GD,CN,25,2014-3-31,2,0,0,0,100,0,1134,0,2013-6-9,0.25,0,430,297,4,4,195,12123,1,0,0,2,0,0,0,12318,12318,12123,12318,12123,1,0,0,0,22
36041,2010-3-8,2013-9-14,M,4,FS,GD,CN,38,2014-3-31,4,0,0,0,100,0,8016,0,2014-1-3,0.5,0,531,89,37,60,50466,56506,14,0,0,4,0,0,0,106972,106972,56506,106972,56506,1,0,0,0,43
45690,2006-3-30,2006-12-2,W,4,GZ,GD,CN,43,2014-3-31,2,0,0,0,100,0,2594,0,2014-3-3,0.25,0,536,29,166,166,0,0,0,0,0,2,0,0,0,0,0,0,0,0,1,0,0,0,0
61027,2013-2-6,2013-2-14,W,4,GZ,GD,CN,36,2014-3-31,2,0,0,0,100,0,3934,0,2013-2-26,0.4,0,8,400,12,12,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,1,0,0,0
61340,2013-2-17,2013-2-17,W,4,SH,.,CN,29,2014-3-31,2,0,0,0,,0,4222,0,2013-2-23,0.4,0,0,403,6,6,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,1,0,0,0

启动hadoop

cd /apps/hadoop
./sbin/start-dfs.sh
/apps/hadoop/sbin/start-all.sh

启动spark

/apps/spark/sbin/start-all.sh

将本地文件上传到hadoop

hadoop fs -put /data/spark/wordcount/ test.txt  /myspark/wordcount

启动spark-shell

spark-shell

2) 读取数据创建RDD。

从hadoop读取数据到RDD

val rdd = sc.textFile("hdfs://localhost:9000/myspark/wordcount/test.txt");

验证是否读取成功,显示读取到的数据。

rdd.map(line=> (line.split('\t')(0),1)).reduceByKey(_+_).collect
rdd.count()

在这里插入图片描述

3) 通过filter操作过滤数据,filter的函数判断数据是否包含“男”字符,可用“contains”方法。

val linesWithSpark=rdd.filter(line=>line.contains("M"))

在这里插入图片描述

4) 用count对步骤(3)的结果进行统计,得到行数。

计算包含男的字符的个数

linesWithSpark.count

显示出过滤后的数据

linesWithSpark.map(line=> (line.split('\t')(0),1)).reduceByKey(_+_).collect

在这里插入图片描述

2.文档中的单词计数

2.数据文件words.txt中包含了多行句子,要求对文档中的单词计数,并把单词计数超过3的结果存储到HDFS上。

WHat is going on there?
I talked to John on email.  We talked about some computer stuff that's it.

I went bike riding in the rain, it was not that cold.

We went to the museum in SF yesterday it was $3 to get in and they had
free food.  At the same time was a SF Giants game, when we got done we
had to take the train with all the Giants fans, they are 1/2 drunk.

实现思路:
(1) 通过textFile的方法读取数据。
(2) 通过flatMap将字符串切分成单词。
(3) 通过map将单词转化为(单词,1)的形式。
(4) 通过reduceByKey将统一个单词的所有值相加。
(5) 通过filter将单词大于3的结果过滤出来。
(6) 通过saveAsTextFile将结果写入到HDFS。

1)准备工作

新建words.txt

cd /data/spark/wordcount
vim words.txt

填入words.txt

WHat is going on there?
I talked to John on email.  We talked about some computer stuff that's it.

I went bike riding in the rain, it was not that cold.

We went to the museum in SF yesterday it was $3 to get in and they had
free food.  At the same time was a SF Giants game, when we got done we
had to take the train with all the Giants fans, they are 1/2 drunk.

将数据上传到hadoop

hadoop fs -put /data/spark/wordcount/words.txt  /myspark/wordcount 

启动spark-shell

spark-shell

2) 通过textFile的方法读取数据

val textFile = sc.textFile("hdfs://localhost:9000/myspark/wordcount/words.txt");

在这里插入图片描述

3) 通过flatMap将字符串切分成单词,通过map将单词转化为(单词,1)的形式,通过reduceByKey将统一个单词的所有值相加。

val wordCounts = textFile.flatMap(line=>line.split(“ ”)).map(word=>(word,1)).reduceByKey((a,b)=>a+b)

在这里插入图片描述

4)通过filter将单词大于3的结果过滤出来

saveAsFile=wordCounts.filter(_._2>3).collect()

在这里插入图片描述

5) 通过saveAsTextFile将结果写入到HDFS。

在这里插入图片描述

3.统计每个用户收藏商品数量

3.某电商网站记录了大量用户对商品的收藏数据,并将数据存储在名为buyer_favorite1的文件中,数据格式以及数据内容如下:

用户ID(buyer_id),商品ID(goods_id),收藏日期(dt)
用户id  商品id    收藏日期
10181  1000481  2010-04-04 16:54:31
20001  1001597  2010-04-07 15:07:52
20001  1001560  2010-04-07 15:08:27
20042  1001368  2010-04-08 08:20:30
20067  1002061  2010-04-08 16:45:33

要求使用Spark Scala API或Spark Java API对用户收藏数据,进行wordcount操作,统计每个用户收藏商品数量。

1)准备工作

创建文件

mkdir -p /data/spark3/wordcount
cd /data/spark3/wordcount
vim buyer_favorite
i
10181  1000481  2010-04-04 16:54:31
20001  1001597  2010-04-07 15:07:52
20001  1001560  2010-04-07 15:08:27
20042  1001368  2010-04-08 08:20:30
20067  1002061  2010-04-08 16:45:33
Esc
:wq

文件上传到HDFS

Hadoop fs -put /data/spark3/wordcount/ buyer_favorite /myspark3/wordcount

在这里插入图片描述
进入 Spark_shell

 
Spark_shell

在这里插入图片描述

var rdd = sc.textFile(“hdfs://localhost:9000 /myspark3/wordcount/ buyer_favorite”);

在这里插入图片描述

Rdd.map(line=>(line.split(“ ”)(0),0)).reduceByKey(_+_).collect

在这里插入图片描述

  • 11
    点赞
  • 68
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

计忆芳华

制作不易,欢迎打赏

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值