Spark SQL和特征处理

Spark数据处理方式主要有三种:RDD、DataFrame、Spark SQL

三者的主要差异在于是否定义Schema

RDD的数据未定义Schema(也就是未定义字段名及数据类型)。使用上必须有Map/Reduce的概念,需要高级别的程序设计能力。但是功能也最强,能完成所有Spark功能。

Spark DataFrame建立时必须定义Schema(定义每一个字段名与数据类型)

Spark SQL是由DataFrame衍生出来的,我们必须先建立DataFrame,然后通过登录Spark SQL temp table,就可以使用Spark SQL语法了。

易使用度:Spark SQL>DataFrame>RDD

sparksql

https://www.jianshu.com/p/3a991fe7fd84
阿里天池智联招聘处理数据举例:

RDD1 = sc.textFile("/bigdata")  
RDD1.count()
RDD2=RDD1.map(lambda line:line.split(","))  
#通过RDD2创建DataFrame,定义DataFrame的每一个字段名与数据类型
from pyspark.sql import Row  
zhilian_Rows = RDD2.map(lambda p:  
Row(  
num_people=p[0],  
Company_name=p[1],  
Job_name=p[8],  
work_place=p[9],  
Experience=p[14],  
)  
)  

2创建了zhilian_Rows之后,使用sqlContext.createDataFrame()方法写入zhilian_Rows数据
from pyspark.sql import SQLContext  
sqlContext = SQLContext(sc)  
zhilian_df = sqlContext.createDataFrame(zhilian_Rows) #sqlContext.createDataFrame创建
zhilian_df.printSchema()   #查看DataFrames的Schema
zhilian_df.show() #.show()方法来查看前5行数据
zhilian_df.alias("df") #创建别名
df.select("Company_Type").groupby("Company_Type").count().show() #使用DataFrame统计公司性质及数量

3创建PySpark SQL
sqlContext.registerDataFrameAsTable(df, "zhilian_table")   #使用registerTempTable方法将df转换为表
sqlContext.sql("select count(*) counts from zhilian_table").show()   #sqlContext.sql查询

其他代码:

from pyspark.mllib.feature import Word2Vec
from pyspark.ml.feature import Word2Vec
from pyspark.ml.feature import CountVectorizer, CountVectorizerModel, Tokenizer, RegexTokenizer, StopWordsRemover
 
#1.从数据库中提取数据
org_df  = spark.sql("select label,username  from XXX ")
 
#2.将提取的数据转换成DataFrame格式
res_rdd = org_df.rdd.map(list).map(lambda x:[x[0],' '.join([i for i in x[1]])]).map(tuple)
#print(res_rdd.take(100))
res_df = spark.createDataFrame(res_rdd,['label','username'])
 
#3.使用tokenizer分词
tokenizer = Tokenizer(inputCol="username", outputCol="words")
t_words = tokenizer.transform(res_df)

2 特征处理

from pyspark.sql import SQLContext
from pyspark import SparkContext
sc =SparkContext()
sqlContext = SQLContext(sc)
data = sqlContext.read.format('spark.csv').options(header='true',inferschema='true').load('train.csv')
data = data.select([column for column in data.columns if column not in drop_list])
data.show(5)
data.printSchema() #显示结构

from pyspark.sql.functions import col
data.groupBy("descrip").count().orderBy(col("count").desc()).show() #包含犯罪数量最多的20个描述

#使用Pipeline
from pyspark.ml.feature import RegexTokenizer, StopWordsRemover, CountVectorizer
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler

regexTokenizer = RegexTokenizer(inputCol="Descript", outputCol="words", pattern="\\W")# regular expression tokenizer
add_stopwords = ["http","https","amp","rt","t","c","the"]  # stop words
stopwordsRemover = StopWordsRemover(inputCol="words", outputCol="filtered").
setStopWords(add_stopwords)
countVectors = CountVectorizer(inputCol="filtered", outputCol="features",
vocabSize=10000, minDF=5)# bag of words count
label_stringIdx = StringIndexer(inputCol = "Category", outputCol = "label")

pipeline = Pipeline(stages=[regexTokenizer, stopwordsRemover, countVectors, 
label_stringIdx])
pipelineFit = pipeline.fit(data) # Fit the pipeline to training documents.
dataset = pipelineFit.transform(data)
dataset.show(5)

3文本分类各种模型 https://cloud.tencent.com/developer/article/1096712

(trainingData, testData) = dataset.randomSplit([0.7, 0.3], seed = 100)
lr = LogisticRegression(maxIter=20, regParam=0.3, elasticNetParam=0)
lrModel = lr.fit(trainingData)
predictions = lrModel.transform(testData)
predictions.filter(predictions['prediction'] == 0) \
    .select("Descript","Category","probability","label","prediction") \
    .orderBy("probability", ascending=False) \.show(n = 10, truncate = 30)

from pyspark.ml.evaluation import MulticlassClassificationEvaluator
evaluator = MulticlassClassificationEvaluator(predictionCol="prediction")
evaluator.evaluate(predictions)

其他参考

https://www.jianshu.com/p/680be5650e68:(利用PySpark 数据预处理(特征化))

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值