nlp-beginner task4 基于LSTM+CRF的序列标注

本文档介绍了基于LSTM+CRF的序列标注任务,详细讲解了参考的PyTorch实现,并分享了训练结果。在总结中,作者推荐了多个关于CRF的资源,并提出对CRF与HMM实际应用区别的疑惑。
摘要由CSDN通过智能技术生成

https://github.com/FudanNLP/nlp-beginner


1. 代码

参考的pytorch官方的ADVANCED: MAKING DYNAMIC DECISIONS AND THE BI-LSTM CRF,模型部分一模一样没什么好说的,只是看懂了再加了点注释

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import random_split
import pandas as pd
import numpy as np
import random

torch.manual_seed(1)
data = []

f = open('./train.txt', 'r', encoding='utf-8')
f.readline()
line = f.readline()
phrase = []
token = []
while line:
    if line == '\n':
        if len(token) > 0:
            data.append([phrase, token])
            phrase = []
            token = []
    else:
        phrase.append(line.split()[0])
        token.append(line.split()[-1])
    line = f.readline()
data_len = len(data)  # 14986

word_to_ix = {}  # 给每个词分配index
ix_to_word = {}
label_to_ix = {}
ix_to_label = {}
word_set = set()
label_set = set()
for sent, toke in data:
    for word in sent:
        if word not in word_to_ix:
            ix_to_word[len(word_to_ix)] = word
            word_to_ix[word] = len(word_to_ix)
            word_set.add(word)
    for tokens in toke:
        if tokens not in label_to_ix:
            ix_to_label[len(label_to_ix)] = tokens
            label_to_ix[tokens] = len(label_to_ix)
            label_set.add(tokens)

unk = '<unk>'
ix_to_word[len(word_to_ix)] = unk
word_to_ix[unk] = len(word_to_ix)
word_set.add(unk)

START_TAG = "<START>"
STOP_TAG = "<STOP>"
ix_to_label[len(label_to_ix)] = START_TAG
label_to_ix[START_TAG] = len(label_to_ix)
label_set.add(START_TAG)
ix_to_label[len(label_to_ix)] = STOP_TAG
label_to_ix[STOP_TAG] = len(label_to_ix)
label_set.add(STOP_TAG)

train_len = int(0.8 * data_len)
test_len = data_len - train_len
train_data, test_data = random_split(data, [train_len, test_len])  # 分割数据集
# print(type(train_data))  # torch.utils.data.dataset.Subset
train_data = list(train_data)
test_data = list(test_data)

# 参数字典,方便成为调参侠
args = {
    'vocab_size': len(word_to_ix),  # 有多少词,embedding需要以此来生成词向量
    'embedding_size': 50,  # 每个词向量有几维(几个特征)
    'hidden_size': 16,
    
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值