Spark数据处理方式主要有三种:RDD、DataFrame、Spark SQL
三者的主要差异在于是否定义Schema
RDD的数据未定义Schema(也就是未定义字段名及数据类型)。使用上必须有Map/Reduce的概念,需要高级别的程序设计能力。但是功能也最强,能完成所有Spark功能。
Spark DataFrame建立时必须定义Schema(定义每一个字段名与数据类型)
Spark SQL是由DataFrame衍生出来的,我们必须先建立DataFrame,然后通过登录Spark SQL temp table,就可以使用Spark SQL语法了。
易使用度:Spark SQL>DataFrame>RDD
sparksql
https://www.jianshu.com/p/3a991fe7fd84
阿里天池智联招聘处理数据举例:
RDD1 = sc.textFile("/bigdata")
RDD1.count()
RDD2=RDD1.map(lambda line:line.split(","))
#通过RDD2创建DataFrame,定义DataFrame的每一个字段名与数据类型
from pyspark.sql import Row
zhilian_Rows = RDD2.map(lambda p:
Row(
num_people=p[0],
Company_name=p[1],
Job_name=p[8],
work_place=p[9],
Experience=p[14],
)
)
2创建了zhilian_Rows之后,使用sqlContext.createDataFrame()方法写入zhilian_Rows数据
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
zhilian_df = sqlContext.createDataFrame(zhilian_Rows) #sqlContext.createDataFrame创建
zhilian_df.printSchema() #查看DataFrames的Schema
zhilian_df.show() #.show()方法来查看前5行数据
zhilian_df.alias("df") #创建别名
df.select("Company_Type").groupby("Company_Type").count().show() #使用DataFrame统计公司性质及数量
3创建PySpark SQL
sqlContext.registerDataFrameAsTable(df, "zhilian_table") #使用registerTempTable方法将df转换为表
sqlContext.sql("select count(*) counts from zhilian_table").show() #sqlContext.sql查询
其他代码:
from pyspark.mllib.feature import Word2Vec
from pyspark.ml.feature import Word2Vec
from pyspark.ml.feature import CountVectorizer, CountVectorizerModel, Tokenizer, RegexTokenizer, StopWordsRemover
#1.从数据库中提取数据
org_df = spark.sql("select label,username from XXX ")
#2.将提取的数据转换成DataFrame格式
res_rdd = org_df.rdd.map(list).map(lambda x:[x[0],' '.join([i for i in x[1]])]).map(tuple)
#print(res_rdd.take(100))
res_df = spark.createDataFrame(res_rdd,['label','username'])
#3.使用tokenizer分词
tokenizer = Tokenizer(inputCol="username", outputCol="words")
t_words = tokenizer.transform(res_df)
2 特征处理
from pyspark.sql import SQLContext
from pyspark import SparkContext
sc =SparkContext()
sqlContext = SQLContext(sc)
data = sqlContext.read.format('spark.csv').options(header='true',inferschema='true').load('train.csv')
data = data.select([column for column in data.columns if column not in drop_list])
data.show(5)
data.printSchema() #显示结构
from pyspark.sql.functions import col
data.groupBy("descrip").count().orderBy(col("count").desc()).show() #包含犯罪数量最多的20个描述
#使用Pipeline
from pyspark.ml.feature import RegexTokenizer, StopWordsRemover, CountVectorizer
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler
regexTokenizer = RegexTokenizer(inputCol="Descript", outputCol="words", pattern="\\W")# regular expression tokenizer
add_stopwords = ["http","https","amp","rt","t","c","the"] # stop words
stopwordsRemover = StopWordsRemover(inputCol="words", outputCol="filtered").
setStopWords(add_stopwords)
countVectors = CountVectorizer(inputCol="filtered", outputCol="features",
vocabSize=10000, minDF=5)# bag of words count
label_stringIdx = StringIndexer(inputCol = "Category", outputCol = "label")
pipeline = Pipeline(stages=[regexTokenizer, stopwordsRemover, countVectors,
label_stringIdx])
pipelineFit = pipeline.fit(data) # Fit the pipeline to training documents.
dataset = pipelineFit.transform(data)
dataset.show(5)
3文本分类各种模型 https://cloud.tencent.com/developer/article/1096712
(trainingData, testData) = dataset.randomSplit([0.7, 0.3], seed = 100)
lr = LogisticRegression(maxIter=20, regParam=0.3, elasticNetParam=0)
lrModel = lr.fit(trainingData)
predictions = lrModel.transform(testData)
predictions.filter(predictions['prediction'] == 0) \
.select("Descript","Category","probability","label","prediction") \
.orderBy("probability", ascending=False) \.show(n = 10, truncate = 30)
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
evaluator = MulticlassClassificationEvaluator(predictionCol="prediction")
evaluator.evaluate(predictions)
其他参考
https://www.jianshu.com/p/680be5650e68:(利用PySpark 数据预处理(特征化))