The documentation for Spacy 2.0 mentions that the developers have added functionality to allow for Spacy to be pickled so that it can be used by a Spark Cluster interfaced by PySpark, however, they don't give instructions on how to do this.
Can someone explain how I can pickle Spacy's English-language NE parser to be used inside of my udf functions?
This doesn't work:
from pyspark import cloudpickle
nlp = English()
pickled_nlp = cloudpickle.dumps(nlp)
解决方案
Not really an answer, but the best workaround I've discovered:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType, ArrayType
import spacy
def get_entities_udf():
def get_entities(text):
global nlp
try:
doc = nlp(unicode(text))
except:
nlp = spacy.load('en')
doc = nlp(unicode(text))
return [t.label_ for t in doc.ents]
res_udf = udf(get_entities, StringType(ArrayType()))
return res_udf
documents_df = documents_df.withColumn('entities', get_entities_udf()('text'))