预备
1.在hdfs上创建目录
[root@hadoop101 /]# hadoop fs -mkdir /spark
2.上传stu.txt文件
[root@hadoop101 temdata]# hadoop fs -put stu.txt /spark
3.查看
[root@hadoop101 temdata]# hadoop fs -ls /spark
Found 1 items
-rw-r--r-- 3 root supergroup 68 2020-06-21 20:52 /spark/stu.txt
[root@hadoop101 temdata]# more stu.txt
1 zhang
2 wang
3 li
4 zhao
5 chen
6 liu
7 huang
8 yang
9 bai
进入spark(在bin目录下)
[root@hadoop101 spark-2.4.4-hadoop2.7]# cd bin/
[root@hadoop101 bin]# ./pyspark
1.查看
>>> lines = sc.textFile("/spark/stu.txt")
>>> print(lines.first())
1 zhang
查看前两条信息:
>>> for line in lines.take(2):
... print(line)
...
1 zhang
2 wang
注意print前面一定要空格
>>>nums = sc.parallelize([1, 2, 3, 4])
>>>squared = nums.map(lambda x: x * x).collect()
>>>for num in squared:
>>> print "%i " % (num)
...
1
4
9
16
>>> lines = sc.parallelize(["hello world", "hi"])
>>> print(lines.first())
>>> words = lines.flatMap(lambda line: line.split(" "))
>>> print(words.first())
hello
去重
>>> diswords = words.distinct()
>>> for word in diswords.take(3):
... print(word)
...
world
hi
hello