5个文本分类baseline基线模型,无报错运行songyingxin/TextClassification

德彪稳坐倒骑驴

已于 2023-03-20 22:32:45 修改

阅读量1.1k

点赞数 2

文章标签：中文分词 nlp 自然语言处理分类 Powered by 金山文档

于 2023-03-15 15:06:53 首次发布

本文链接：https://blog.csdn.net/Albert233333/article/details/129550629

版权

引入与介绍

代码：https://github.com/songyingxin/TextClassification

模型设计思想的讲解帖子：几个可作为Baseline的文本分类模型https://zhuanlan.zhihu.com/p/64603089

word level

TextCNN

TextRNN

LSTM_ATT

TextRCNN

TransformerText

char-level + word-level

TextCNNHighway

TextRNNHighway

LSTMATTHighway

TextRCNNHighway

它这个代码，其实做的就是情感分析。给你个评论，让你分类“积极”和“消极”

配置相应的环境

代码发布时间2019年4月30号。所以你安装的各种包的版本，要比前面这个发布日期早最少1-2个月。否则如果package的版本过于新，会出现各种奇怪的报错。

python 3.6.1

作者只说了大版本好是3.6，但是他没说是3.6.几。经过我的尝试，3.6.1是最合适的

conda create --name py361tc100 python=3.6.1

但是，如果你装的是3.6.0，会报下面这个错

这个人解释的：https://stackoverflow.com/questions/60011092/importerror-cannot-import-name-deque

Traceback (most recent call last):
  File "run_SST.py", line 17, in <module>
    from Utils.utils import word_tokenize, get_device, epoch_time, classifiction_metric
  File "/media/F:/FILES_OF_ALBERT/IT_paid_class/IT培训班/贪心-小牛实习/Code/TextClassification/Utils/utils.py", line 1, in <module>
    import spacy
  File "/home/albert/anaconda3/envs/py360tc100/lib/python3.6/site-packages/spacy/__init__.py", line 6, in <module>
    from .errors import setup_default_warnings
  File "/home/albert/anaconda3/envs/py360tc100/lib/python3.6/site-packages/spacy/errors.py", line 2, in <module>
    from .compat import Literal
  File "/home/albert/anaconda3/envs/py360tc100/lib/python3.6/site-packages/spacy/compat.py", line 3, in <module>
    from thinc.util import copy_array
  File "/home/albert/anaconda3/envs/py360tc100/lib/python3.6/site-packages/thinc/__init__.py", line 5, in <module>
    from .config import registry
  File "/home/albert/anaconda3/envs/py360tc100/lib/python3.6/site-packages/thinc/config.py", line 2, in <module>
    import confection
  File "/home/albert/anaconda3/envs/py360tc100/lib/python3.6/site-packages/confection/__init__.py", line 10, in <module>
    from pydantic import BaseModel, create_model, ValidationError, Extra
  File "pydantic/__init__.py", line 2, in init pydantic.__init__
  File "pydantic/dataclasses.py", line 4, in init pydantic.dataclasses
    import types
  File "pydantic/error_wrappers.py", line 4, in init pydantic.error_wrappers
  File "pydantic/json.py", line 12, in init pydantic.json
  File "pydantic/types.py", line 28, in init pydantic.types
    yield
  File "pydantic/validators.py", line 9, in init pydantic.validators
ImportError: cannot import name Deque

pytorch == 1.0

针对pytorch1.0，官方仅仅提供了cuda10.0, cuda9.0, cuda8.0这个cuda很老的版本想匹配的pytorch。但是我的cuda版本很新是10.1的，

# CUDA 10.0
conda install pytorch==1.0.0 torchvision==0.2.1 cuda100 -c pytorch
# CUDA 9.0
conda install pytorch==1.0.0 torchvision==0.2.1 cuda90 -c pytorch
# CUDA 8.0
conda install pytorch==1.0.0 torchvision==0.2.1 cuda80 -c pytorch

上面这三条官方的安装命令。我都试过，但是都无法安装pytorch。显示package网站上没有与我这个版本相匹配的package(PackagesNotFoundError)。安装不成功返回来这些东西

Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.

PackagesNotFoundError: The following packages are not available from current channels:

  - pytorch==1.0.0

Current channels:

  - https://conda.anaconda.org/pytorch/linux-64
  - https://conda.anaconda.org/pytorch/noarch
  - http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge/linux-64
  - http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge/noarch
  - https://mirrors.bfsu.edu.cn/anaconda/pkgs/free/linux-64
  - https://mirrors.bfsu.edu.cn/anaconda/pkgs/free/noarch
  - https://repo.anaconda.com/pkgs/main/linux-64
  - https://repo.anaconda.com/pkgs/main/noarch
  - https://repo.anaconda.com/pkgs/r/linux-64
  - https://repo.anaconda.com/pkgs/r/noarch

To search for alternate channels that may provide the conda package you're
looking for, navigate to

    https://anaconda.org

and use the search bar at the top of the page.

我又不想花时间安装双cuda.于是被迫安装了cpu版本的1.0.0的torch

conda install pytorch-cpu==1.0.0 torchvision-cpu==0.2.1 cpuonly -c pytorch

此时我们装好了pytorch，但是你在代码中“import torch”会报下面这个错

>>> import torch
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/albert/anaconda3/envs/py361tc100/lib/python3.6/site-packages/torch/__init__.py", line 84, in <module>
    from torch._C import *
ImportError: /home/albert/anaconda3/envs/py361tc100/lib/python3.6/site-packages/torch/lib/libmkldnn.so.0: undefined symbol: cblas_sgemm_alloc

然后你在shell中执行下面这句代码，这个报错就解决了

（ref:https://blog.csdn.net/Lstar_/article/details/118658610）

conda install -c anaconda mkl

下面这些包的版本，作者并没有说版本号，这些都是我试出来的。

torchtext == 0.6.0

pip install torchtext==0.6.0

如果你的torchtext装的是0.8.0,装的过于新了，就会报下面这个错误。这个帖子（https://blog.csdn.net/YHR14/article/details/108472276）告诉你，报这个错的原因是torchtext版本过高不兼容导致了这个报错。

Traceback (most recent call last):
  File "run_SST.py", line 13, in <module>
    from torchtext import data
  File "/home/albert/anaconda3/envs/py360tc100/lib/python3.6/site-packages/torchtext/__init__.py", line 40, in <module>
    _init_extension()
  File "/home/albert/anaconda3/envs/py360tc100/lib/python3.6/site-packages/torchtext/__init__.py", line 36, in _init_extension
    torch.ops.load_library(ext_specs.origin)
  File "/home/albert/anaconda3/envs/py360tc100/lib/python3.6/site-packages/torch/_ops.py", line 102, in load_library
    ctypes.CDLL(path)
  File "/home/albert/anaconda3/envs/py360tc100/lib/python3.6/ctypes/__init__.py", line 344, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libtorch.so: cannot open shared object file: No such file or directory

spacy==2.1.0

选择这个版本是因为，这个文本分类的九个代码的项目是2019年4月30号发布的。这个代码发布之前spacy版本是2.1.0，是2019年3月18号发布

pip install spacy==2.1.0

但是，如果你是用下面这句话，安装的最新版本的spacy（3.5.1）

pip install spacy

就会报下面这个错

Traceback (most recent call last):
  File "train_eval.py", line 7, in <module>
    from Utils.utils import classifiction_metric
  File "/media/F:/FILES_OF_ALBERT/IT_paid_class/IT_training/greedyAI_intern/Code/TextClassification/Utils/utils.py", line 1, in <module>
    import spacy
  File "/home/albert/anaconda3/envs/py361tc100/lib/python3.6/site-packages/spacy/__init__.py", line 6, in <module>
    from .errors import setup_default_warnings
  File "/home/albert/anaconda3/envs/py361tc100/lib/python3.6/site-packages/spacy/errors.py", line 2, in <module>
    from .compat import Literal
  File "/home/albert/anaconda3/envs/py361tc100/lib/python3.6/site-packages/spacy/compat.py", line 3, in <module>
    from thinc.util import copy_array
  File "/home/albert/anaconda3/envs/py361tc100/lib/python3.6/site-packages/thinc/util.py", line 14, in <module>
    from contextvars import ContextVar
  File "/home/albert/anaconda3/envs/py361tc100/lib/python3.6/site-packages/contextvars/__init__.py", line 4, in <module>
    import immutables
  File "/home/albert/anaconda3/envs/py361tc100/lib/python3.6/site-packages/immutables/__init__.py", line 18, in <module>
    from ._protocols import MapKeys as MapKeys
  File "/home/albert/anaconda3/envs/py361tc100/lib/python3.6/site-packages/immutables/_protocols.py", line 6, in <module>
    from typing import NoReturn
ImportError: cannot import name 'NoReturn'

针对这个报错有下面三种说法。

报错的原因是python3.6.1的typing里面没有 NoReturn这个函数，所以无法导入。（1）要么你升级python3.6.1到3.6.2,因为3.6.2这个版本，的typing里面有了NoReturn这个功能（2）要么你在3.6.1这个版本里面使用这样的写法来导入“typing_extensions.NoReturn”https://github.com/psf/black/issues/1666

降低pip版本即可（https://blog.csdn.net/qq_39237205/article/details/125728985）——他们是升级了pip以后，才报的这个错，但是我没有升级pip，所以应该不是方面的问题

有人说是typing库版本的问题，降版本试试。（https://github.com/httprunner/httprunner/issues/968）

其实最简单的办法就是，安装老一点版本的spacy2.1.0这个版本的

其他包的安装

pip install matplotlib==3.0.2 # 2018年11月11号f发布的
pip install scikit-learn==0.20.2 #2018年12月19号
pip install torchtext==0.3.1 # 2018年11月11号发布的
pip install TensorboardX 
pip install tqdm

然后你运行代码，

python run_SST.py --do_train --epoch_num=1

会有这样的报错

                           the model name is TransformerText
device is cpu, not recommend
Traceback (most recent call last):
  File "run_SST.py", line 118, in <module>
    embedding_folder, model_dir, log_dir))
  File "run_SST.py", line 41, in main
    text_field = data.Field(tokenize='spacy', lower=True, include_lengths=True, fix_length=config.sequence_length)
  File "/home/albert/anaconda3/envs/py361tc100/lib/python3.6/site-packages/torchtext/data/field.py", line 152, in __init__
    self.tokenize = get_tokenizer(tokenize)
  File "/home/albert/anaconda3/envs/py361tc100/lib/python3.6/site-packages/torchtext/data/utils.py", line 12, in get_tokenizer
    spacy_en = spacy.load('en')
  File "/home/albert/anaconda3/envs/py361tc100/lib/python3.6/site-packages/spacy/__init__.py", line 22, in load
    return util.load_model(name, **overrides)
  File "/home/albert/anaconda3/envs/py361tc100/lib/python3.6/site-packages/spacy/util.py", line 136, in load_model
    raise IOError(Errors.E050.format(name=name))
OSError: [E050] Can't find model 'en'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

然后你在shell里面运行下面这句话，就解决了

ref:https://blog.csdn.net/ming_ruo_xiao_xi/article/details/86558267

python -m spacy download en

数据集和embedding文件放在指定位置

数据集

作者在自己这个代码项目repo的README.md文档里也写了，这个文本分类项目使用的数据集的下载链接

sst-2: 链接：https://pan.baidu.com/s/1ax9uCjdpOHDxhUhpdB0d_g     提取码：rxbi

你下载下来以后，把SST-2文件夹里面的东西放在，run_SST.py同一个文件夹下面的这个位置./dataset/SST-2

然后去代码里，改一下数据文件夹的路径地址

run_SST.py line95 改成下面这样
# data_dir = "/search/hadoop02/suanfa/songyingxin/data/SST-2"
data_dir = "./dataset/SST-2"
cache_dir = ".cache/"
# embedding_folder = "/search/hadoop02/suanfa/songyingxin/data/embedding/glove/"
embedding_folder = "./dataset/embedding/glove/"

如果你不按照我上面说的，把数据集下载下来放在指定位置，会报下面这个错，说找不到数据文件

                           the model name is TransformerText
device is cpu, not recommend
Traceback (most recent call last):
  File "run_SST.py", line 118, in <module>
    embedding_folder, model_dir, log_dir))
  File "run_SST.py", line 44, in main
    train_iterator, dev_iterator, test_iterator = load_sst2(config.data_path, text_field, label_field, config.batch_size, device, config.glove_word_file, config.cache_path)
  File "/media/F:/FILES_OF_ALBERT/IT_paid_class/IT_training/greedyAI_intern/Code/TextClassification/Utils/SST2_utils.py", line 12, in load_sst2
    fields=[('text', text_field), ('label', label_field)])
  File "/home/albert/anaconda3/envs/py361tc100/lib/python3.6/site-packages/torchtext/data/dataset.py", line 78, in splits
    os.path.join(path, train), **kwargs)
  File "/home/albert/anaconda3/envs/py361tc100/lib/python3.6/site-packages/torchtext/data/dataset.py", line 251, in __init__
    with io.open(os.path.expanduser(path), encoding="utf8") as f:
FileNotFoundError: [Errno 2] No such file or directory: '/search/hadoop02/suanfa/songyingxin/data/SST-2/train.tsv'

embedding文件

我们做文本分类，文本是计算机无法理解的东西。计算机能理解的东西就是数字、数字组成的向量。我们将文字转换成数字组成的向量，以便于计算机可以处理和使用。

不管你用什么方法，来做文本的向量化。最后拿到了是一个字典，这个字典里有每一个字符/单词/word/character 和一个数字向量的一一对应关系。就是告诉你，这个字，用向量怎么表示。

所以我们这里并不需要太多关注，文字是怎么向量化的。只需要拿到别人训练好的这个字符和数字向量一一对应的字典文件，拿过来直接用就行。

这里作者选用的过来glove这个方法训练出来的词嵌入的一个文件。“glove.840B.300d.txt”

你可以去斯坦福官网上下载https://nlp.stanford.edu/projects/glove/

或者kaggle上这个网址下载https://www.kaggle.com/datasets/takuok/glove840b300dtxt?resource=download

下载好这个词嵌入的字典“glove.840B.300d.txt”，放到这个位置"./dataset/embedding/glove/"（前面刚刚定义的路径）

如果你不按照我说的这样把词嵌入的字典放在指定位置，就会报这个错，说找不到“glove.840B.300d.txt”这个文件

        the model name is TransformerText
device is cpu, not recommend
the size of train: 65328, dev:872, test:2021
Traceback (most recent call last):
  File "run_SST.py", line 120, in <module>
    embedding_folder, model_dir, log_dir))
  File "run_SST.py", line 44, in main
    train_iterator, dev_iterator, test_iterator = load_sst2(config.data_path, text_field, label_field, config.batch_size, device, config.glove_word_file, config.cache_path)
  File "/media/F:/FILES_OF_ALBERT/IT_paid_class/IT_training/greedyAI_intern/Code/TextClassification/Utils/SST2_utils.py", line 15, in load_sst2
    vectors = vocab.Vectors(embedding_file, cache_dir)
  File "/home/albert/anaconda3/envs/py361tc100/lib/python3.6/site-packages/torchtext/vocab.py", line 280, in __init__
    self.cache(name, cache, url=url, max_vectors=max_vectors)
  File "/home/albert/anaconda3/envs/py361tc100/lib/python3.6/site-packages/torchtext/vocab.py", line 327, in cache
    raise RuntimeError('no vectors found at {}'.format(path))
RuntimeError: no vectors found at .cache/./dataset/embedding/glove/glove.840B.300d.txt

这里简单解释一下这个词嵌入字典，里面存的是什么东西。

这个txt文件里面保存的是，每一个英文词的向量

你可以看到，这些字符（, . the and to of a in " : is）这些词对应300个维度的向量表示

运行代码，实现训练、测试、评估

然后运行代码

# 用这个，会先训练，再测试，打印出测试结果（performance score）
python run_SST.py --do_train --epoch_num=10   # train and test

没有任何报错，可以正常运行了

使用一个小一点的数据集训练，缩短训练时间

因为训练用的数据集很大，训练一次要有点久的时间。我想把word-level的下面五个模型都训练一下，确认这些模型的代码都是可以跑通的。

word-level——run_SST.py

TextCNN

TextRNN

LSTMATT

TextRCNN

TransformerText

你可以让代码使用的训练集的是一个特别少数据量（行数很少）的tsv文件，这样训练花的时间就少了。

run_SST.py 的 line105这样改

# 正常训练应该用的
# data_dir = "./dataset/SST-2"
# 做实验的时候，为了快速确定这个训练的代码是可以跑的，，我们会采用一个较小的训练集和测试集，

# 测试是否跑通的时候应该用的
data_dir = "./dataset/SST-2/mini"

然后你在这个mini文件夹里，放三个文件 “train.tsv”“test.tsv”“dev.tsv”。

你要注意“train.tsv”里面的数据不能少于这个多行，14732行。因为它这个训练过程是每1.5万条数据，保存一下模型参数。如果你的数据集过少，都少于1.5w行，那么训练了一阵，没有模型文件保存。你都没有模型文件可被调用，自然做测试集的时候会报错，说找不到模型文件，就像下面这样报错

如果你的训练集train.tsv是四千条数据，报错会是下面这样。说你size mismatch

Traceback (most recent call last):
  File "run_SST.py", line 134, in <module>
    embedding_folder, model_dir, log_dir))
  File "run_SST.py", line 89, in main
    model.load_state_dict(torch.load(model_file))
  File "/home/albert/anaconda3/envs/py361tc100/lib/python3.6/site-packages/torch/nn/modules/module.py", line 769, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for TextRNN:
        size mismatch for embedding.weight: copying a param with shape torch.Size([14732, 300]) from checkpoint, the shape in current model is torch.Size([4749, 300]).

如果你按照我说的，让“train.tsv”里面有大于1.5w条数据，就能像下面这样，成功运行

word level的五个模型是否可以运行？

经过逐一运行，word level的下面五个模型都可以运行，没有报错

TextCNN

TextRNN

LSTMATT

TextRCNN

TransformerText

只运行测试代码，会报错吗？

python run_SST.py

可以的，没有报错

运行highway的模型

highway模型的文本表示是用的char-level + word-level

highway模型包含下面四个

char-level + word-level——run_Highway_SST.py

TextRCNNHighway,

TextCNNHighway，

TextRNNHighway,

LSTMATTHighway

数据集的路径

run_Highway_SST.py的这些地方要改

——line 103,说明一下数据存储的位置

# 原来的
# data_dir = "/home/songyingxin/datasets/SST-2"
# 实际应该用
# data_dir = "./dataset/SST-2"
# 为了快速做测试，我使用的
data_dir = "./dataset/SST-2/mini"

——line144说明一下embedding字典的位置

# 原来的
# embedding_folder = "/home/songyingxin/datasets/WordEmbedding/glove/"
# 实际应该用的
embedding_folder = "/dataset/embedding/glove/"

./Utils/SST2_utils.py里面这些东西还要改

——line65

# data_dir = "/home/songyingxin/datasets/SST-2"
# data_dir = "./dataset/SST-2"
data_dir = "./dataset/SST-2/mini"

——line 75改一下路径

# word_emb_file = "/home/songyingxin/datasets/WordEmbedding/glove/glove.840B.300d.txt"
word_emb_file = "./dataset/embedding/glove/glove.840B.300d.txt"

将embedding文件下载好，放到指定位置

前面说了highway这四个模型用的是char-level + word-level的文本编码方式。

那么你自然需要char-level进行文本向量化的字典字母/标点和对应的数字向量

你去这个网站（https://www.kaggle.com/datasets/chenwgen/glove840b300dchar）把这个文本文件"glove.840B.300d-char.txt"下载下来，放到这个路径（"./dataset/embedding/glove/"）

然后在代码了改一下路径

# char_emb_file = "/home/songyingxin/datasets/WordEmbedding/glove/glove.840B.300d-char.txt"
char_emb_file = "./dataset/embedding/glove/glove.840B.300d-char.txt"

run_Highway_SST.py line118改成这样

# 原来的
# embedding_folder = "/home/songyingxin/datasets/WordEmbedding/glove/"
# 实际应该用的
embedding_folder = "./dataset/embedding/glove/"

这简单介绍一下这个glove.840B.300d-char.txt里面的东西什么意思

"glove.840B.300d-char.txt"这个文件，不同于"glove.840B.300d.txt"。它进行向量化，要表示的不是 is of apple tree这种单词，而是组成单词（word）的那些character（实际就是letter,就是组成单词的字母）。实际上就是 26个大写字母和26个小写字母以及一些标点符号，这些东西加起来一共94个。所以"glove.840B.300d.txt"里面有94行数据，每行表示一个字母或标点对应的向量表示，向量表示用300个维度的数字来表示一个字母

将数据集按规定处理，放到指定位置

highway这个代码“run_Highway_SST.py”里面使用的数据集，不只是是“run_SST.py”里面的那三个tsv文件，“train.tsv”“test.tsv”“dev.tsv”，而是使用三个“jsonl”文件。

“jsonl”文件里面数据是用下面这个形式来组织的

# trans.py line28
dump.append(dict([
    ('idx', idx),
    ('text', text),
    ('label', label)
]))

如何改成这样的格式，作者提供了代码。代码文件，就在你下载的作者组着的那个SST-2数据集（sst-2: 链接：https://pan.baidu.com/s/1ax9uCjdpOHDxhUhpdB0d_g 提取码：rxbi ）里面的那个trans.py

这个trans.py还要做下面这些调整

line57改成这样，然后运行

trans("train.tsv", "train.json")
trans("dev.tsv", "dev.json")
trans("test.tsv", "test.json")

# tokenizer = BertTokenizer.from_pretrained(
#     "/home/songyingxin/datasets/pytorch-bert/vocabs/bert-base-uncased-vocab.txt", do_lower_case=True)
#
# analysis("train.tsv", "train", tokenizer)
# analysis("dev.tsv", "dev", tokenizer)
# analysis("test.tsv", "test", tokenizer)

line15这样改

row[1]表示的是第一行的表头“label”这个字符串。一个字符串，怎么能转成整数呢？自然会报错。——所以你要做的是，在转换的时候，如果是第一行的话，自动跳过数据整理这个步骤

with open(input_file, 'r', encoding='utf-8') as fh:
    rowes = csv.reader(fh, delimiter='\t')

    iter_count = 1
    for row in rowes:
        # 跳过第一次循环，避免表头被当做数据存储下来
        if iter_count == 1:
            # 在跳过循环之前，赶快加了1，这样下一次循环的时候，iter_count就是2了就不会进入这个判断了
            iter_count += 1
            continue


        idx = str(total)
        label = int(row[1])
        text = row[0]

        dump.append(dict([
            ('idx', idx),
            ('text', text),
            ('label', label)
        ]))
        total += 1

如果“trans.py”line15你不这样改会报下面这样的错

Traceback (most recent call last):
  File "trans.py", line 57, in <module>
    trans("train.tsv", "train.json")
  File "trans.py", line 15, in trans
    label = int(row[1])
ValueError: invalid literal for int() with base 10: 'label'

然后运行"trans.py",你就会拿到那三个jsonl文件

python trans.py

运行代码

python run_Highway_SST.py --do_train --epoch_num=1

经过测试 highway的这四个模型都可以运行，不会有报错

TextRCNNHighway

TextCNNHighway

TextRNNHighway

LSTMATTHighway

char-level和 word-level 这两种tokenier的区别

tokenizer主要干两个事（1）分词（2）将分好的词进行向量化的表示

说明一下word和character的区别

英文

word是单词，比如apple

character 组成单词的那些字母，比如 a， p ，p ，l ，e 这五个字母

中文

word是词语，比如勇敢、学习、烧饼

character是组成这些词语的单字，比如勇和敢；学和习；烧和饼

所以二者的区别是文本向量表示方式的不同的区别。一个针对apple做向量化编码，一个将apple这个单词用一个五个元素组成的列表来表示，这个列表中每个元素对应字母的向量表示。

这样的好处是 apple apples 这种单词的变形（复数、时态），不同形式的同一个词的相似度天然的更高。

https://zhuanlan.zhihu.com/p/360290118

德彪稳坐倒骑驴

关注

2
点赞
踩
5

收藏

觉得还不错? 一键收藏
打赏
2
评论
5个文本分类baseline基线模型,无报错运行songyingxin/TextClassification

songyingxin/TextClassification这个项目实现了九个模型的文本分类。本文从配置环境开始，一步步讲解了应该如何操作，后面才可以把代码运行起来。其中包括，（1）每一个package应该装哪个版本的。repo的作者仅仅讲了python的小版本号以及pytorch的版本号，其实其他package的版本过高也会引发报错。（2）训练的数据集和词向量embedding文件从哪个网站下载，应该放到什么位置去。（3）char level的训练应该如何生成训练所需的指定格式的数据集文件
复制链接

扫一扫