精通Python自然语言处理 1 :字符串操作

代码  https://github.com/PacktPublishing/Mastering-Natural-Language-Processing-with-Python
1、切分

    将文本分割成更小的并被称作标识符的模块的过程。sent_tokenize函数使用了NLTK包的一个叫PunktSentenceTokenizer类的实例。基于那些可以标记句子开始和结束的字母和标记符号,这个歌实例已经被训练用于对不同的欧洲语言执行切分。

import nltk
text=" Welcom readers. I hope you find it interesting. Please do reply."
from nltk.tokenize import sent_tokenize
sent_tokenize(text)
Out[4]: [' Welcom readers.', 'I hope you find it interesting.', 'Please do reply.']

切分大批量的句子,可以加载PunktSentenceTokenizer并使用其tokenize()函数,也可加载其它语言

tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')  # french.pickle
tokenizer.tokenize(text)

将句子切分为单词,使用word_tokenize()函数,其使用NLTK包的一个叫TreebankWordTokenizer类的实例

text=nltk.word_tokenize("I hope you find it interesting.")
print(text)
['I', 'hope', 'you', 'find', 'it', 'interesting', '.']

通过分离缩略词来实现切分

text=nltk.word_tokenize("Don't hesitate to ask questions")
print(text)
['Do', "n't", 'hesitate', 'to', 'ask', 'questions']
还可以通过加载TreebankWordTokenizer,然后调用tokenzie()函数来完成。
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize("Have a nice day. I hope you find the book interesting")
Out[14]: 
['Have',
 'a',
 'nice',
 'day.',
  ...

另一个通过分离标点来切分的PunktWordTokenizer,还有一个分词器是WordPunctTokenizer,通过将标点转化为一个全新的标识符来实现切分:

from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()
tokenizer.tokenize("Don't hesitate to ask questions")
Out[17]: ['Don', "'", 't', 'hesitate', 'to', 'ask', 'questions']

分词器的继承树:


使用正则表达式实现切分:P20,通过匹配单词与匹配空格或间隔的方法

from nltk.tokenize import RegexpTokenizer
tokenizer=RegexpTokenizer("[\w]+")
tokenizer.tokenize("Don't hesitate to ask questions")
Out[20]: ['Don', 't', 'hesitate', 'to', 'ask', 'questions']
2、标准化

    主要涉及消除标点符号、转为大写或小写、数字转换成单词、扩展缩略词、文本的规范化等操作

    消除标点

    文本的大小写转换:

    处理停止词:需要被过滤掉的词,因为这些词对理解句子的整体意思没有多大的意义。搜索引擎通过去除停止词来工作,以便缩小搜索范围。可从nltk_data/corpora/stopwords中访问停止词列表

from nltk.corpus import stopwords
stops = set(stopwords.words('english'))
words = ["Don't", "hesitate", "to", "ask", "questions"]
[word for word in words if word not in stops]
Out[28]: ["Don't", 'hesitate', 'ask', 'questions']
3、替换和校正标识符

    使用正则表达式替换单词:

    用单词的同义词替换

4、在文本上应用Zipf定律

    Zipf定律指出,文本中标识符出现的频率与其在排序列表中的排名或位置成反比。该定律描述了标识符在语言中是如何分布的:一些标识符非常频繁地出现,另一些出现频率较低,还有一些基本上不出现。

import nltk
from nltk.corpus import gutenberg
from nltk.probability import FreqDist
import matplotlib
import matplotlib.pyplot as plt 
matplotlib.use('TkAgg')
fd = FreqDist()
for text in gutenberg.fileids():
	for word in gutenberg.words(text):
		fd[word] += 1

ranks = []
freqs = []
for rank, word in enumerate(fd):
	ranks.append(rank+1)
	freqs.append(fd[word])

plt.loglog(ranks, freqs)
plt.xlabel('frequency(f)', fontsize=14, fontweight='bold')
plt.ylabel('rank(r)', fontsize=14, fontweight='bold')
plt.grid(True)
plt.show()


5、相似性度量

    ntlk.metrics包用于提供各种评估或相似性度量

    使用编辑距离算法:

    使用Jaccard系数

    使用Smith Waterman距离

    其它字符串相似性度量

     

Mastering Natural Language Processing with Python by Deepti Chopra, Nisheeth Joshi, Iti Mathur 2016 | ISBN: 1783989041 | English | 238 pages Maximize your NLP capabilities while creating amazing NLP projects in Python About This Book Learn to implement various NLP tasks in Python Gain insights into the current and budding research topics of NLP This is a comprehensive step-by-step guide to help students and researchers create their own projects based on real-life applications Who This Book Is For This book is for intermediate level developers in NLP with a reasonable knowledge level and understanding of Python. What You Will Learn Implement string matching algorithms and normalization techniques Implement statistical language modeling techniques Get an insight into developing a stemmer, lemmatizer, morphological analyzer, and morphological generator Develop a search engine and implement POS tagging concepts and statistical modeling concepts involving the n gram approach Familiarize yourself with concepts such as the Treebank construct, CFG construction, the CYK Chart Parsing algorithm, and the Earley Chart Parsing algorithm Develop an NER-based system and understand and apply the concepts of sentiment analysis Understand and implement the concepts of Information Retrieval and text summarization Develop a Discourse Analysis System and Anaphora Resolution based system In Detail Natural Language Processing is one of the fields of computational linguistics and artificial intelligence that is concerned with human-computer interaction. It provides a seamless interaction between computers and human beings and gives computers the ability to understand human speech with the help of machine learning. This book will give you expertise on how to employ various NLP tasks in Python, giving you an insight into the best practices when designing and building NLP-based applications using Python. It will help you become an expert in no time and assist you in creating your own NLP projects using NLTK. You will sequentially be guided through applying machine learning tools to develop various models. We'll give you clarity on how to create training data and how to implement major NLP applications such as Named Entity Recognition, Question Answering System, Discourse Analysis, Transliteration, Word Sense disambiguation, Information Retrieval, Sentiment Analysis, Text Summarization, and Anaphora Resolution. Style and approach This is an easy-to-follow guide, full of hands-on examples of real-world tasks. Each topic is explained and placed in context, and for the more inquisitive, there are more details of the concepts used.
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值