1.将HanLP的data(包含词典和模型)放到hdfs上,然后在项目配置文件hanlp.properties中配置root的路径,比如:
root=hdfs://localhost:9000/tmp/
2.实现com.hankcs.hanlp.corpus.io.IIOAdapter接口:
public static class HadoopFileIoAdapter implements IIOAdapter {
@Override
public InputStream open(String path) throws IOException {
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(path), conf);
return fs.open(new Path(path));
}
@Override
public OutputStream create(String path) throws IOException {
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(path), conf);
OutputStream out = fs.create(new Path(path));
return out;
}
}
3.设置IoAdapter,创建分词器:
private static Segment segment;
static {
HanLP.Config.IOAdapter = new HadoopFileIoAdapter();
segment = new CRFSegment();
}
然后,就可以在Spark的操作中使用segment进行分词了。