快速搭建Pyspark开发环境,方便代码编写和调试~~
1、下载并安装JDK
2、下载并安装Anacadon3
3、下载hadoop
4、下载winutils.exe并放在hadoop\bin目录下
5 、pip install -U -i https://pypi.tuna.tsinghua.edu.cn/simple pyspark安装pyspark和py4j.
6、Pycharm环境测试
pyspark代码一:
import os
os.environ['JAVA_HOME'] = "C:\Program Files\Java\jdk1.8.0_191"
os.environ['HADOOP_HOME'] = "E:\software\hadoop"
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("PythonWordCount")
sc = SparkContext(conf = conf)
links = sc.parallelize(["A","B","C","D"])
C = links.flatMap(lambda dest:(dest,1)).count()
D = links.map(lambda dest:(dest,1)).count()
print(C)
print(D)
c = links.flatMap(lambda dest:(dest,1)).collect()
d = links.map(lambda dest:(dest,1)).collect()
print(c)
print(d)
代码运行结果为:
8
4
['A', 1, 'B', 1, 'C', 1, 'D', 1]
[('A', 1), ('B', 1), ('C', 1), ('D', 1)]
Process finished with exit code 0
pyspark代码二:
import os
os.environ['JAVA_HOME'] = "C:\Program Files\Java\jdk1.8.0_191"
os.environ['HADOOP_HOME'] = "E:\software\hadoop"
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
# 初始化
spark = SparkSession.builder.master("local[*]").appName("FiratApp1").getOrCreate()
# 下面两句都可以获取0到9的数据
# data = spark.createDataFrame(map(lambda x: (x,), range(10)), ["id"])
data = spark.range(0, 10).select(col("id").cast("double"))
# 求和
data.agg({'id': 'sum'}).show()
# 关闭
spark.stop()
+-------+
|sum(id)|
+-------+
| 45.0|
+-------+