spark初探

第一次使用spark进行大数据处理,记录下过程:

使用python

准备工作

安装pip:

curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
python3 get-pip.py

安装pyspark模块:

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pyspark

编写wordcount.py程序

from pyspark.sql import SparkSession
appName="wordcount_cy"
spark = SparkSession.builder.master("yarn").appName(appName).getOrCreate()
text_file = spark.read.text("/caoyong/books/9527.txt").rdd.map(lambda r: r[0])
counts = text_file.flatMap(lambda x : x.split(' ')).map(lambda x : (x,1)).reduceByKey(lambda x,y : x+y)
output = counts.collect()
for (word, count) in output:
   print("%s: %i" % (word, count))
spark.stop() # 停止SparkSession

运行wordcount.py

[datathink@server191 books]$ python3 wordcount.py
2021-06-09 12:00:01,121 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
The: 12
Project: 79
Gutenberg: 22
eBook: 7
of: 155
30: 12
Tempting: 2
Spaghetti: 46
Meals,: 1
by: 29
Campbell: 2
Soup: 7
Company: 2
: 2765
This: 3
is: 42
................

在运行的时候,可能发生如下错误:

pyspark出现Java.io.IOException: Cannot run program "python": CreateProcess error=2问题的解决办法

需要设置环境变量

vim .bashrc
export PYSPARK_PYTHON=/usr/bin/python3
source .bashrc

与hive交互

  public static void main(String[] args) {
         session.sql("drop table if exists students_info");
         session.sql("create table if not exists students_info(name string,age int) "
                 + "row format delimited fields terminated by '\t' \r\n");
 
         // 将数据导入学生信息表
         session.sql(
                 "load data local inpath '/opt/module/spark-test/data/student_infos.txt' into table default.students_info");
 
         session.sql("drop table if exists students_score");
         session.sql("create table if not exists students_score(name string,score int)  \r\n"
                 + "row format delimited fields terminated by '\t' \r\n");
 
         // 将数据导入学生成绩表
         session.sql(
                 "load data local inpath '/opt/module/spark-test/data/student_scores.txt' into table default.students_score");
 
         // 查询
         Dataset<Row> dataset = session.sql(
                 "select s1.name,s1.age,s2.score from students_info s1 join students_score s2 on s1.name=s2.name where s2.score>80");
 
         // 将dataset中的数据保存到hive中
         session.sql("drop table if exists students_result");
         dataset.write().saveAsTable("students_result");
 
         // 将hive中的表转成dataset,查看数据是否成功保存
         Dataset<Row> table = session.table("students_result");
         table.show();
 
         session.stop();
 
     }
 }

spark API:https://blog.csdn.net/sdut406/article/details/103445486
例子参考:https://www.yuque.com/easyexcel/doc/write

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 3
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值