gluonnlp
sinat_24395003
先学使用轮子,再学造轮子,再自己造轮子
展开
-
长文本切成短句的bert+lstm训练过程
#学习率很重要,lr=2e-5训练集准确率0.99,lr=1e-3,训练集准确率0.6,loss降不下来。#lstm的sequence是变长的,注意测试设置batch合理大小,确保不爆内存#因为使用bertembedding,所以尽量切文本的sequence在500左右,确保内存合理使用???下述代码不是这么做的。import gluonnlp as nlpimport mxnet as mxfrom mxnet.gluon.block import HybridBlockfrom mxnet.原创 2020-11-24 14:47:55 · 1812 阅读 · 0 评论 -
中文命名实体识别mxnet_bertner_cn
代码参照https://nlp.gluon.ai/model_zoo/ner/index.html1.单gpu训练改成多gpu训练2.数据读取部分的修改,以及新增预测部分的数据处理3.预测部分,可以考虑对超过seq_len的句子进行按标点符号进行分割成多样本预测。4.ner_predict.py可独立存在用于预测(但是需要保存的模型参数)5.后续有可能新增crf模块。finetune_bertcn.py代码import argparseimport loggingimpor原创 2020-11-16 12:55:25 · 433 阅读 · 0 评论 -
self.params的使用注意点
from mxnet.gluon import nnfrom mxnet import ndclass MyDense(nn.HybridBlock): def __init__(self, units, in_units, **kwargs): super().__init__(**kwargs) self.embedding = nn.Embedding(3, 5) self.weight = self.params.get('weight.原创 2020-11-12 15:00:35 · 2619 阅读 · 1 评论 -
gluon.utils.split_and_load进行多gpu训练碰到的小问题MXNetError: Check failed: (*begin < *end): Invalid begin, en
import numpy as npfrom mxnet import gluon,npx,nddata = np.arange(15).reshape(3, 5)dataloader = gluon.data.DataLoader(data, batch_size=2, shuffle=False, last_batch='keep')devices = [npx.gpu(0), npx.gpu(1)]for da.原创 2020-11-11 10:16:23 · 589 阅读 · 0 评论 -
glue.truncate_seqs_equal, glue.concat_sequences
from gluonnlp.data.bert import glueseqs = [[1, 2, 3], [4, 5, 6]]print(glue.truncate_seqs_equal(seqs, 4))seqs = [[1, 2, 3], [4, 5, 6]]print(glue.truncate_seqs_equal(seqs, 5))seqs =[['is', 'this', 'jacksonville', '?'], ['no', 'it', 'is', 'not', '.'].原创 2020-11-09 14:30:43 · 142 阅读 · 0 评论 -
self attentive sentence embedding 分类模型的可解释性
#个人理解#代码参考https://nlp.gluon.ai/examples/sentiment_analysis/self_attentive_sentence_embedding.html#这个模型可以用于提取影响分类的关键词的抽取,算是半监督算法。#用的训练数据样本极不均衡,人工标注较少,错误样本较多,准确率85%(其他算法用不均衡抽样能达到99%),暂无改进计划。导致最终抽取出来的关键词并不好。#感觉本算法更适用于短文本分类的解释。import osimport jso.原创 2020-11-03 17:08:36 · 900 阅读 · 0 评论 -
gluonnlp.model.attention_cell接口示例
from gluonnlp.model import attention_cellfrom mxnet import ndimport mxnet as mximport numpy as npmx.random.seed(10000)att_score = nd.arange(12).reshape(2,2,3)print('att_score: ', att_score)mask = nd.random.randint(0,2,shape=(2,2,3)).astype('float32.原创 2020-10-15 17:41:25 · 182 阅读 · 0 评论 -
2016 google machine translation 英译中
2016 google machine translation 代码参考https://gluon-nlp.mxnet.io/examples/machine_translation/gnmt.html个人理解:encoder-decoder框架:encoder和decoder 采用多个双向RNN,接着又堆叠了多个单向RNN,并把对应层的隐状态给decoder进行初始化,值得注意的是encoder的双向RNN的隐状态结果只取后向RNN的隐状态结果给decoder初始化。encoder和decoder的.原创 2020-10-12 13:04:05 · 325 阅读 · 0 评论 -
FixedBucketSampler二
import numpy as npimport gluonnlpfrom gluonnlp.data.sampler import ConstWidthBucket, _match_bucket_keys, _bucket_statsimport warningsclass Sampler(object): """Base class for samplers. All samplers should subclass `Sampler` and define `__iter.原创 2020-09-15 12:46:28 · 240 阅读 · 0 评论 -
FixedBucketSampler(一)_match_bucket_keys
import numpy as npdef _match_bucket_keys(bucket_keys, seq_lengths): """ :param bucket_keys: 每个区间的右侧值。 :param seq_lengths: 序列长度列表,存每个句子的长度的列表。 :return: 每个区间存放的样本id值,便于取出来 """ bucket_key_npy = np.array(bucket_keys, dtype=np.int32).原创 2020-09-15 12:44:20 · 190 阅读 · 0 评论 -
ExpWidthBucket指数级/ConstWidthBucket划分数据的区间(分箱)
# 根据最大值最小值,区间个数,用区间指数级增长函数求出bucket_keys值(每段区间右侧值)import mxnet as mximport mathINT_TYPES = mx.base.integer_typesclass BucketScheme: r"""Base class for generating bucket keys.""" def __call__(self, max_lengths, min_lengths, num_buckets): .原创 2020-09-14 18:56:09 · 160 阅读 · 0 评论 -
nlp.data.batchify.CorpusBPTTBatchify的学习理解
import gluonnlp as nlpfrom gluonnlp.data.dataset import CorpusDataset,SimpleDatasetfrom mxnet import np,npximport mxnet as mximport mathnpx.set_np()#batch_size = 5bptt = 6def wordtoword_splitter(s): """按字分割""" return list(s)def _slic.原创 2020-09-04 17:56:48 · 341 阅读 · 0 评论 -
transform,_LazyTransformDataset,transform_first使用简单介绍
import ioimport osfrom gluonnlp.data.dataset import Dataset, SimpleDatasetclass Dataset(object): """Abstract dataset class. All datasets should have this interface. Subclasses need to override `__getitem__`, which returns the i-th element.原创 2020-09-04 10:23:39 · 363 阅读 · 0 评论 -
SimpleDataset和CorpusDataset的简单使用
import ioimport osfrom gluonnlp.data.dataset import Datasetdef line_splitter(s): """Split a string at newlines. 按行分割字符串 Parameters ---------- s : str The string to be split Returns -------- List[str] Li.原创 2020-09-04 09:59:05 · 488 阅读 · 0 评论 -
gluonnlp.vocab简单解析
# Licensed to the Apache Software Foundation (ASF) under one# or more contributor license agreements. See the NOTICE file# distributed with this work for additional information# regarding copyright ownership. The ASF licenses this file# to you under.原创 2020-08-26 17:54:24 · 741 阅读 · 0 评论 -
Bert WordpieceTokenizer
Bert WordpieceTokenizer字符片段的分词。个人理解分出vocab中的更小片段,分词思想从字符后往前遍历字符段与vocab进行匹配,匹配到则保留,然后检索字符片段剩下的片段。def tokenize(text): unk_token='<unk>' vocab=["un", "##aff", "##able"] output_tokens = [] for token in [text]: chars = list(token.原创 2020-08-21 18:23:10 · 2921 阅读 · 0 评论 -
理解mx.nd.sparse.csr_matrix函数
"""具体理解csr_matrix参照https://cloud.tencent.com/developer/article/1387734"""def csr_matrix(arg1, shape=None, ctx=None, dtype=None): """Creates a `CSRNDArray`, an 2D array with compressed sparse row (CSR) format. The CSRNDArray can be instantiated.原创 2020-06-12 13:53:00 · 308 阅读 · 0 评论 -
理解 gluonnlp.data.batchify的pad机制
import mathimport mxnet as mximport numpy as npimport warningsdef _pad_arrs_to_max_length(arrs, pad_axis, pad_val, use_shared_mem, dtype, round_to=None): """Inner Implementation of the Pad batchify 填充[arr,arr]列表的数组维度是该维度下的最大值 Parameters .原创 2020-06-09 21:38:06 · 363 阅读 · 0 评论 -
理解gluonnlp fasttext_ngram_hashes生成函数
from numba import njitimport numpy as npdef _fasttext_ngram_hashes(word, ns, bucket_size): """生成ns的ngram的word hash码""" hashes = [] max_n = np.max(ns) for i in range(len(word)): # pylint: disable=consider-using-enumerate if (wor..原创 2020-05-16 21:10:12 · 305 阅读 · 0 评论