spaCy V3.0 文本分类模型训练、评估、打包及数据预处理

最新推荐文章于 2024-05-12 12:08:51 发布

Cxrlyy

最新推荐文章于 2024-05-12 12:08:51 发布

阅读量1.8k

点赞数 2

分类专栏： # spaCy V3.0 自然语言处理NLP 文章标签： nlp 机器学习数据挖掘

本文链接：https://blog.csdn.net/u014607067/article/details/115294589

版权

自然语言处理NLP 同时被 2 个专栏收录

14 篇文章 6 订阅

订阅专栏

spaCy V3.0

11 篇文章 9 订阅

订阅专栏

spaCy V3.0 文本分类模型训练、评估、打包及数据预处理

1 GPU的使用问题

本机显卡：NVIDIA GeForce GT 740M 算力：3.5 对应的pyTorch最高可用版本为1.2 spaCy
transformer模型的最低匹配pyTorch版本为1.5 故经过不断尝试，未能使用。
（对于算力5.2以上的GPU，如要使用，要下载CUDA 工具包，目前为“cuda_11.1.0_456.43_win10.exe”）

2 spacy命令的使用

2.1 用法：

spacy [OPTIONS] COMMAND [ARGS]...

2.2 参数及说明：

No	Commands	Description
1.	convert	Convert files into json or DocBin format for training.
2.	debug	Suite of helpful commands for debugging and profiling.
3.	download	Download compatible trained pipeline from the default download…
4.	evaluate	Evaluate a trained pipeline.
5.	info	Print info about spaCy installation.
6.	init	Commands for initializing configs and pipeline packages.
7.	package	Generate an installable Python package for a pipeline.
8.	pretrain	Pre-train the ‘token-to-vector’ (tok2vec) layer of pipeline…
9.	project	Command-line interface for spaCy projects and templates.
10.	train	Train or update a spaCy pipeline.
11.	validate	Validate the currently installed pipeline packages and spaCy…

3 从克隆项目开始

以文本分类为例(textcat)

3.1 从github上下载项目

建立目录，在该目录下以命令行方式运行以下命令：

python -m spacy project clone https://hub.fastgit.org/explosion/projects/tree/v3/tutorials/textcat_goemotions

下载完成包含两个目录
scripts
configs
project.yml文件

3.2 更换自己的数据

assets目录：
cats.txt 分类标签每行一个
train.csv 训练数据
dev.csv 验证数据
test.csv 测试数据
corpus目录：运行 spacy project run preprocess 时生成以下文件(调用scripts目录下的convert_corpus.py)
train.spacy
dev.spacy
test.spacy

1 原项目是以制表符分隔，本项目是逗号分隔
2 原项目数据是包含数据标注者，本项目没有
3 原项目一个样本可以包含多个标签，本项目只有一个

convert_corpus.py代码更改：

from pathlib import Path
import typer
from spacy.tokens import DocBin
import spacy


ASSETS_DIR = Path(__file__).parent / "assets"
CORPUS_DIR = Path(__file__).parent / "corpus"

def read_categories(path: Path):
	return path.open().read().strip().split("\n")


def read_csv(file_):
	for line in file_:
		text, label = line.strip().split(",")
		yield {
			"text": text,
			"label": label
		}


def convert_record(nlp, record, categories):
	"""Convert a record from the csv into a spaCy Doc object."""
	doc = nlp.make_doc(record["text"])
	# All categories other than the true ones get value 0
	doc.cats = {category: 0 for category in categories}
	# True labels get value 1
	doc.cats[record["label"]] = 1
	return doc


def main(data_dir: Path=ASSETS_DIR, corpus_dir: Path=CORPUS_DIR, lang: str="zh"):
	"""Convert the GoEmotion corpus's tsv files to spaCy's binary format."""
	categories = read_categories(data_dir / "categories.txt")
	nlp = spacy.blank(lang)
	for csv_file in data_dir.iterdir():
		if not csv_file.parts[-1].endswith(".csv"):
			continue
		records = read_csv(csv_file.open(encoding="utf8"))
		docs = [convert_record(nlp, record, categories) for record in records]
		out_file = corpus_dir / csv_file.with_suffix(".spacy").parts[-1]
		out_data = DocBin(docs=docs).to_bytes()
		with out_file.open("wb") as file_:
			file_.write(out_data)


if __name__ == "__main__":
	typer.run(main)

3.3 训练模型

spacy project run train

训练完成生成training目录，包含nodel-best和model_last目录

3.4 评估模型

spacy project run evaluate

3.5 打包模型

spacy project run package

生成packages目录，在\dist目录下会有训练好的模型包 XXXX.tar.gz文件，这个可以用pip install。用法同标准预训练模型。

4 中文分词

下载下来的项目是基于英文的，对于中文最好使用中文分词器。

[nlp]
lang = “zh”
pipeline = [“tok2vec”,“textcat”]

[nlp.tokenizer]
@tokenizers = “spacy.zh.ChineseTokenizer”
segmenter = “pkuseg”

[initialize]
vectors = “zh_core_web_lg”

[initialize.tokenizer]
pkuseg_model = “mixed”
pkuseg_user_dict = “user_dict.txt”

用户自定义词典也可以放到数据目录，但要根据实际情况调整路径。

4.1 "mixed"模型

pkuseg的"mixed"模型，最好提前下载，解压到C:\Users\Administrator.pkuseg目录下。

4.2 用户词典在package时出现以下错误：

File "c:\users\administrator\appdata\local\programs\python\python37\lib\site-packages\spacy_pkuseg\__init__.py", line 58, in __init__
     assert isinstance(w_t, tuple)
AssertionError

修改上述.py文件：

在52行下加入：

w_t = tuple(w_t)

w_t为List,值为[‘密封风’, ‘’]等，程序需要tuple，故进行转换。

具体什么原因导致类型不匹配，暂不清楚，先顺利通过以验证模型。以后再研究！！！

5 加载模型并测试

5.1 在命令行下，设置当前目录为/packages/dist

5.2 pip install 模型：

pip install zh_textcat_aux-0.0.1.tar.gz

5.3 测试模型

import spacy
nlp = spacy.load('zh_textcat_aux')
texts = ['变频装置操作原则','变频装置送电启动前检查项目','凝泵变频器检修转热备用']
docs = nlp.pipe(texts)
for doc in docs:
	print(doc.text)
	print(doc.cats)

输出结果：

变频装置操作原则
{'A': 1.8430232273658476e-07, 'B': 2.4967513923002116e-07, 'C': 0.9736177921295166, 'E': 1.4260081115935463e-06, 'M': 0.0003109535900875926, 'O': 0.0260681863874197, 'R': 1.220762669618125e-06, 'S': 3.233772005728497e-08}
变频装置送电启动前检查项目
{'A': 2.9710254256798407e-09, 'B': 0.9999549388885498, 'C': 1.8397947769699385e-06, 'E': 1.1763377472107095e-07, 'M': 2.796637090796139e-07, 'O': 2.1979019493301166e-07, 'R': 4.259566048858687e-05, 'S': 1.0315843231717414e-12}
凝泵变频器检修转热备用
{'A': 0.000247561139985919, 'B': 0.00014120333071332425, 'C': 0.00045776416664011776, 'E': 0.0004549270961433649, 'M': 0.010306362062692642, 'O': 0.9880962371826172, 'R': 0.00028140778886154294, 'S': 1.4502625163004268e-05}

这三个句子对应的标记模型给出的结果：‘C’, ‘B’ , ‘O’
我们期望的也是：‘C’ , ‘B’ , ‘O’

结论：

对于同样的训练数据和测试数据，spaCy V3.0的表现显然超过了spaCy V2.3.0。
spaCy V3.0的数据预处理及模型训练、评估、打包，全部以配置文件形式进行处理，其结构和流程更加清晰。

Cxrlyy

关注

2
点赞
踩
4

收藏

觉得还不错? 一键收藏
2
评论
spaCy V3.0 文本分类模型训练、评估、打包及数据预处理

spaCy V3.0 文本分类模型训练、评估、打包及数据预处理1 GPU的使用问题本机显卡：NVIDIA GeForce GT 740M 算力：3.5 对应的pyTorch最高可用版本为1.2 spaCytransformer模型的最低匹配pyTorch版本为1.5 故经过不断尝试，未能使用。（对于算力5.2以上的GPU，如要使用，要下载CUDA 工具包，目前为“cuda_11.1.0_456.43_win10.exe”）2 spacy命令的使用2.1 用法：spacy [OPTIONS]
复制链接

扫一扫

专栏目录