2018年08月_剑九黄

08月 05月 04月

原创 01_字符串处理-----05_相似性度量

# NLTK中的nltk.metrics包用于提供各种评估或相似性度量from __future__ import print_functionfrom nltk.metrics import *def main1(): training = 'PERSON OTHER PERSON OTHER OTHER ORGANIZATION'.split() testing =...

2018-08-26 12:58:16 308

原创 01_字符串处理-----04_在文本中应用ZIpf定律

# Zipf定律指出，文本中标识符出现的频率与其在排序列表中的排名或位置成反比。# 所以，频率最高的单词出现的频率大约是出现频率第二位的单词的2倍，而出现频率第二位的单词则是出现频率第四位的单词的2倍。# 该定律描述了标识符在语言中是如何分布的：一些标识符非常频率的出现，另一些出现频率较低，还有一些基本上不出现。# 使用NLTK获取Zipf定律的双对数图# 单词在文档中的排名相对其出现...

2018-08-26 12:57:45 597

原创 01_字符串处理-----03_替换和校正标识符

1.3.1 使用正则表达式替换单词# 创建replacers.py文件，被调用import rereplacement_patterns = [(r'won\'t', 'will not'),(r'can\'t', 'cannot'),(r'i\'m', 'i am'),(r'ain\'t', 'is not'),(r'(\w+)\'ll', '\g<1> wil...

2018-08-26 12:56:47 359

原创 01_字符串处理-----02_标准化

1.2.1 消除标点符号def main1(): text = [" It is a pleasant evening.", "Guests, who came from US arrived at the venue", "Food was tasty."] from nltk.tokenize import word_tokenize tokenized_docs...

2018-08-25 09:50:04 735

原创 01_字符串处理------01_切分

1.1.1 将文本切分为语句def main1(): from nltk.tokenize import sent_tokenize import nltk text = " Welcome readers from U.S. I hope you find it interesting. Please do reply." print(sent_toke...

2018-08-23 23:36:05 663

空空如也

TA创建的收藏夹 TA关注的收藏夹

TA关注的人