python技能实践系列(七)-- jupyter notebook 用spark读取本地文件实现简单的wordcount功能
- 如果你正在用的是公司某台机器上的jupyter,不知道当前的工作目录,可以用下面的代码查看。展示的是绝对路径。
import os
os.getcwd()
2.用spark读取当前工作目录下的文件。非远程文件,用file://表示读取本地文件
from pyspark import SparkContext
from pyspark.sql import SparkSession, Row
from pyspark.sql.types import StringType, StructField, StructType
import json
import csv
spark=SparkSession.builder \
.appName("test") \
.enableHiveSupport() \
.getOrCreate()
sc = spark.sparkContext
csv_file ="file:///jupyter_notebook/test/text.csv"
# csv_file ="file:///your_path/text.csv"
movieRdd= sc.textFile(csv_file)
movieCsv =movieRdd.map(lambda line: Row(*next(csv.reader([line]))))
schemaString = """budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,
production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count"""
mdf = movieCsv.toDF(schemaString.split(','))
mdf.createOrReplaceTempView('movie')
genresDF = spark.sql('select genres from movie')
genresDF.show()
wordItem = genresDF.rdd.flatMap(lambda row:json.loads(row['genres'])).map(lambda x:(x['name'],1))
wordCount=wordItem.reduceByKey(lambda x,y:x+y)
wordCount.take(10)
PS:数据用的是豆瓣5000个电影的数据。