第一次使用spark进行大数据处理,记录下过程:
使用python
准备工作
安装pip:
curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
python3 get-pip.py
安装pyspark模块:
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pyspark
编写wordcount.py程序
from pyspark.sql import SparkSession
appName="wordcount_cy"
spark = SparkSession.builder.master("yarn").appName(appName).getOrCreate()
text_file = spark.read.text("/caoyong/books/9527.txt").rdd.map(lambda r: r[0])
counts = text_file.flatMap(lambda x : x.split(' ')).map(lambda x : (x,1)).reduceByKey(lambda x,y : x+y)
output = counts.collect()
for (word, count) in output:
print("%s: %i" % (word, count))
spark.stop() # 停止SparkSession
运行wordcount.py
[datathink@server191 books]$ python3 wordcount.py
2021-06-09 12:00:01,121 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
The: 12
Project: 79
Gutenberg: 22
eBook: 7
of: 155
30: 12
Tempting: 2
Spaghetti: 46
Meals,: 1
by: 29
Campbell: 2
Soup: 7
Company: 2
: 2765
This: 3
is: 42
................
在运行的时候,可能发生如下错误:
pyspark出现Java.io.IOException: Cannot run program "python": CreateProcess error=2问题的解决办法
需要设置环境变量
vim .bashrc
export PYSPARK_PYTHON=/usr/bin/python3
source .bashrc
与hive交互
public static void main(String[] args) {
session.sql("drop table if exists students_info");
session.sql("create table if not exists students_info(name string,age int) "
+ "row format delimited fields terminated by '\t' \r\n");
// 将数据导入学生信息表
session.sql(
"load data local inpath '/opt/module/spark-test/data/student_infos.txt' into table default.students_info");
session.sql("drop table if exists students_score");
session.sql("create table if not exists students_score(name string,score int) \r\n"
+ "row format delimited fields terminated by '\t' \r\n");
// 将数据导入学生成绩表
session.sql(
"load data local inpath '/opt/module/spark-test/data/student_scores.txt' into table default.students_score");
// 查询
Dataset<Row> dataset = session.sql(
"select s1.name,s1.age,s2.score from students_info s1 join students_score s2 on s1.name=s2.name where s2.score>80");
// 将dataset中的数据保存到hive中
session.sql("drop table if exists students_result");
dataset.write().saveAsTable("students_result");
// 将hive中的表转成dataset,查看数据是否成功保存
Dataset<Row> table = session.table("students_result");
table.show();
session.stop();
}
}
spark API:https://blog.csdn.net/sdut406/article/details/103445486
例子参考:https://www.yuque.com/easyexcel/doc/write