启动
pyspark
IPYTHON=1 pyspark
IPYTHON_OPTS="notebook" pyspark
(set IPYTHON=1
pyspark for windows)
执行python脚本
spark-submit my_script.py
初始化sparkcontext
from pyspark import SparkConf,SparkContext
conf = SparkConf().setMaster("local").setAppName("Myapp")
sc = SparkContext(conf=conf)
Ch5读取csv数据
如果没有换行符
import csv
import StringIO
...
def loadRecord(line):
"""Parse a CSV line"""
input = StringIO.StringIO(line)
reader = csv.DictReader(input, fieldnames=["name", "favouriteAnimal"]) return reader.next()
input = sc.textFile(inputFile).map(loadRecord)
如果有换行符
需要把整个数据集加载进来
def loadRecords(fileNameContents):
"""Load all the records in a given file"""
input = StringIO.StringIO(fileNameContents[1])
reader = csv.DictReader(input, fieldnames=["name", "favoriteAnimal"]) return reader
fullFileData = sc.wholeTextFiles(inputFile).flatMap(loadRecords)