pyspark中 --archives上传bert文件进行预测

最新推荐文章于 2023-11-06 01:03:55 发布

算法驯化师

最新推荐文章于 2023-11-06 01:03:55 发布

阅读量220

点赞数

分类专栏：数据分析文章标签： bert spark 大数据

本文链接：https://blog.csdn.net/lov1993/article/details/133152089

版权

数据分析专栏收录该内容

4 篇文章 0 订阅

订阅专栏

在pyspark环境中提交任务时，为了执行BERT预测，需要通过shell脚本运行spark代码。关键在于，由于BERT模型的大小，不能使用broadcast，而是采用map操作将模型分发到每个executor上。在shell命令中，利用--archives参数上传BERT文件，并在代码中解压成my_bert目录，以确保模型在所有节点可用，从而实现高效预测。

摘要由CSDN通过智能技术生成

pyspark中提交任务

通常工作中将spark代码写好会通过shell脚本来执行spark代码具体如下所示的shell

	spark-3.0/bin/spark-submit \
    --master yarn \
    --deploy-mode cluster \
    --queue queue1\
    --driver-memory 8g  \
    --executor-memory 8g   \
    --executor-cores 5 \
    --num-executors 10 \
    --conf spark.network.timeout=7200 \
    --conf spark.executor.heartbeatInterval=3600 \
    --conf spark.sql.shuffle.partitions=1500 \
    --conf spark.default.parallelism=1000 \
    --conf spark.yarn.executor.memoryOverhead=2048M \
    --archives viewfs:hdfs/tf_env.zip,hdfs/my_bert.zip#my_bert \
    --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./tf_env.zip/tf/bin/python \
    --conf spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=./tf_env.zip/tf/bin/python \
    ./spark_predict_bert.py

这种方式会需要注意的是要通过excutor来执行bert代码，通过map操作将bert放到每个节点上面进行预测，不通通过broadcast的方式对代码进行预测，具体的代码读去如下所示：

text_list = df_res.select("content").rdd.flatMap(lambda x: x).collect()
for i in text_list:
    tokenizer = BertTokenizer('./my_bert/my_bert/vocab.txt')
    model = TFBertForSequenceClassification.from_pretrained("./my_bert/my_bert")
    bert_input = encode_examples(i, tokenizer)
    note_class = predict(model, bert_input)
    print(f"the note is : {i}, and the predict result is: {note_class} !!!")