pysaprk项目实战

版本:pyspark = 3.2.0

模式:local

本文使用旧金山犯罪数据集,该数据集包括Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y,共9个变量。

项目任务:使用Descript列预测Category,具体为分类任务。

导包:

from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.ml.feature import RegexTokenizer, StopWordsRemover, CountVectorizer
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

具体内容:

if __name__ == '__main__':
    spark = SparkSession.builder.\
    appName('test').\
    master('local[*]').\
    getOrCreate()

    data = spark.read.format('csv').option('header', True).\
        option('encoding', 'utf-8').load('./data/train.csv')
    
    data = data.select(['Category', 'Descript']) 

    data.printSchema()
    data.show(5)

    data.groupBy('Category').count().orderBy('count', ascending=False).show()
    data.groupBy('Descript').count().orderBy('count', ascending= False).show()
    # data.groupBy('Category').count().orderBy(F.col('count').desc()).show()
    # data.groupBy('Descript').count().orderBy(F.col('count').desc()).show()

    #正则分词
    regex = RegexTokenizer(inputCol= 'Descript', outputCol= 'words', pattern='\\W')
    stopword = ['http', 'https', 'amp', 'rt', 't', 'c', 'the'] #需要删除的单词
    #删除stopword中的单词
    stopwords = StopWordsRemover(inputCol= 'words', outputCol= 'filtered',stopWords= stopword)

    #以词频将单词转为向量
    CountVectors = CountVectorizer(inputCol= 'filtered', outputCol= 'features', vocabSize= 10000, minDF= 5)
    #将标签列文本转为数值
    label_stringidx = StringIndexer(inputCol= 'Category', outputCol= 'label')
    #管道
    pipeline = Pipeline(stages= [regex, stopwords, CountVectors, label_stringidx])
    pipelinefit = pipeline.fit(data)
    dataset = pipelinefit.transform(data)
    dataset.show(5)
    #训练测试集划分
    (trainingdata, testingdata) = dataset.randomSplit([0.7, 0.3], seed= 100)
    print('train count:'+ str(trainingdata.count()))
    print('test count:' + str(testingdata.count()))
    #逻辑回归分类
    lr = LogisticRegression(maxIter=20, regParam= 0.3, elasticNetParam=0)
    lrmodel = lr.fit(trainingdata)
    result = lrmodel.transform(testingdata)
    result.show(10) #这里展示10行
    
    #准确率
    evaluator = MulticlassClassificationEvaluator(predictionCol= 'prediction')
    accuracy = evaluator.evaluate(result)
    print(accuracy)

如果涉及参数寻优:

#在构造逻辑回归lr之后进行,加上评估器
ParamGrid = ParamGridBuilder().addGrid(lr.regParam, [0.1,0.3,0.5]).\
addGrid(lr.elasticNetParam, [0.0, 0.1, 0.2]).build()
cv = CrossValidator(estimator= lr, estimatorParamMaps= ParamGrid, evaluator= evaluator)
cvmodel = cv.fit(trainingdata)
result1 = cvmodel.transform(testingdata)
result1.show()

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值