【NLP】竞赛必备的NLP库

最新推荐文章于 2024-08-05 15:57:37 发布

风度78

最新推荐文章于 2024-08-05 15:57:37 发布

阅读量477

点赞数 1

文章标签：人工智能编程语言 java 深度学习大数据

NLP必备的库

本周我们给大家整理了机器学习和竞赛相关的NLP库，方便大家进行使用，建议收藏本文。

jieba

jieba是Python中的优秀的中文分词第三方库，通过几行代码就可以完成中文句子的分词。jieba的分词精度和性能非常优异，经常用来进行中文分词的实验对比。此外jieba还可以很方便的自定义词典，使用起来非常灵活。

import jieba


seg_list = jieba.cut("我来到北京清华大学", cut_all=True)
print("Full Mode: " + "/ ".join(seg_list))  # 全模式
# 【全模式】: 我/ 来到/ 北京/ 清华/ 清华大学/ 华大/ 大学
seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print("Default Mode: " + "/ ".join(seg_list))  # 精确模式
# 【精确模式】: 我/ 来到/ 北京/ 清华大学


seg_list = jieba.cut("他来到了网易杭研大厦")  # 默认是精确模式
print(", ".join(seg_list))
# 【新词识别】：他, 来到, 了, 网易, 杭研, 大厦

jieba项目主页：https://github.com/fxsjy/jieba

此外jieba分词还有CPP版本，如果觉得性能不够，可以尝试CPP版本。

spaCy

spaCy是功能强化的NLP库，可与深度学习框架一起运行。spaCy提供了大多数NLP任务的标准功能（标记化，PoS标记，解析，命名实体识别）。spaCy与现有的深度学习框架接口可以一起使用，并预装了常见的语言模型。

import spacy


# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load("en_core_web_sm")


# Process whole documents
text = ("When Sebastian Thrun started working on self-driving cars at "
        "Google in 2007, few people outside of the company took him "
        "seriously. “I can tell you very senior CEOs of major American "
        "car companies would shake my hand and turn away because I wasn’t "
        "worth talking to,” said Thrun, in an interview with Recode earlier "
        "this week.")
doc = nlp(text)


# Analyze syntax
print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])
print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"])


# Find named entities, phrases and concepts
for entity in doc.ents:
    print(entity.text, entity.label_)

spaCy项目主页：https://spacy.io/

Gensim

是一个高效的自然语言处理Python库，主要用于抽取文档的语义主题（semantic topics）。Gensim的输入是原始的、无结构的数字文本（纯文本），内置的算法包括Word2Vec，FastText和LSA。

from gensim.test.utils import common_texts, get_tmpfile
from gensim.models import Word2Vec


path = get_tmpfile("word2vec.model")
model = Word2Vec(common_texts, size=100, window=5, min_count=1, workers=4)
model.save("word2vec.model")

Gensim项目官网：https://radimrehurek.com/gensim/

NLTK

NLTK是一个免费的，开源的，社区驱动的项目，提供了50多种语料库和词汇资源（如WordNet），还提供了一套用于分类，标记化，词干化，标记，解析和语义推理的文本处理库。

import nltk
>>> sentence = """At eight o'clock on Thursday morning
... Arthur didn't feel very good."""
>>> tokens = nltk.word_tokenize(sentence)
>>> tokens
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning',
'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']
>>> tagged = nltk.pos_tag(tokens)
>>> tagged[0:6]
[('At', 'IN'), ('eight', 'CD'), ("o'clock", 'JJ'), ('on', 'IN'),
('Thursday', 'NNP'), ('morning', 'NN')]

NLTK官网：http://www.nltk.org/

TextBlob

TextBlob是一个用python编写的开源的文本处理库，它可以用来执行很多自然语言处理的任务，比如，词性标注、名词性成分提取、情感分析、文本翻译等。

from textblob import TextBlob


text = '''
The titular threat of The Blob has always struck me as the ultimate movie
monster: an insatiably hungry, amoeba-like mass able to penetrate
virtually any safeguard, capable of--as a doomed doctor chillingly
describes it--"assimilating flesh on contact.
Snide comparisons to gelatin be damned, it's a concept with the most
devastating of potential consequences, not unlike the grey goo scenario
proposed by technological theorists fearful of
artificial intelligence run rampant.
'''


blob = TextBlob(text)
blob.tags           # [('The', 'DT'), ('titular', 'JJ'),
                    #  ('threat', 'NN'), ('of', 'IN'), ...]


blob.noun_phrases   # WordList(['titular threat', 'blob',
                    #            'ultimate movie monster',
                    #            'amoeba-like mass', ...])


for sentence in blob.sentences:
    print(sentence.sentiment.polarity)
# 0.060
# -0.341

TextBlob官网：https://textblob.readthedocs.io/en/dev/

CoreNLP

Stanford CoreNLP是用处理自然语言的工具集合。它可以给出词语的基本形式：词性（它们是公司名、人名等，规范化日期，时间，和数字），根据短语和语法依赖来标记句子的结构，发现实体之间的关系、情感以及人们所说的话等。

CoreNLP提供了Java版本的服务器部署，也有python版本的调用，用途非常广泛。在工业界和学术界都有广泛的应用。

CoreNLP官网：https://stanfordnlp.github.io/CoreNLP/

AllenNLP

AllenNLP 是由世界著名的艾伦人工智能实验室（Allen Institute for AI Lab）建立的 NLP 深度学习通用框架，不仅包含了最先进的参考模型，可以进行快速部署，而且支持多种任务和数据集。

AllenNLP官网：https://allennlp.org/

TorchText

TorchText是Pytorch下对NLP的支持库，包含便利的数据处理实用程序，可在批量处理和准备之前将其输入到深度学习框架中。TorchText可以很方便加载训练数据、验证和测试数据集，来进行标记化、vocab构造和创建迭代器，并构建迭代器。

TorchText官网：https://github.com/pytorch/text

Transformers

Transformers是现如今最流行的库，它实现了从 BERT 和 GPT-2 到 BART 和 Reformer 的各种转换。huggingface 的代码可读性强和文档也是清晰易读。在官方github的存储库中，甚至通过不同的任务来组织 python 脚本，例如语言建模、文本生成、问题回答、多项选择等。

huggingface官网：https://huggingface.co/

OpenNMT

OpenNMT 是用于机器翻译和序列学习任务的便捷而强大的工具。其包含的高度可配置的模型和培训过程，让它成为了一个非常简单的框架。因其开源且简单的特性，建议大家使用 OpenNMT 进行各种类型的序列学习任务。

OpenNMT官网：https://opennmt.net/

往期精彩回顾




适合初学者入门人工智能的路线及资料下载机器学习及深度学习笔记等资料打印机器学习在线手册深度学习笔记专辑《统计学习方法》的代码复现专辑
AI基础下载机器学习的数学基础专辑获取一折本站知识星球优惠券，复制链接直接打开：https://t.zsxq.com/662nyZF本站qq群704220115。加入微信群请扫码进群（如果是博士或者准备读博士请说明）：