python自然语言处理第三章习题

最新推荐文章于 2020-12-20 21:56:43 发布

qq_34505594

最新推荐文章于 2020-12-20 21:56:43 发布

阅读量1.8k

点赞数

分类专栏： Python

本文链接：https://blog.csdn.net/qq_34505594/article/details/79495980

版权

Python 专栏收录该内容

16 篇文章 1 订阅

订阅专栏

+：项目中的一个或多个实例
*：项目中的零个或多个实例 +和*有时被称作闭包

^：匹配字符串的开始
\s：匹配所有空白字符
\w：匹配词中的字符，字母，数字，下划线
\W:匹配所有字母、数字、下划线以外的字符
\S:是\s的补
\b：词边界（零宽度）
\d：任一十进制数字
\D：任何非数字字符
\t：制表符
8.编写一个工具函数，以url为参数，返回删除所有HTML标记的URL内容。使用那个url.urlopen访问的url内容，例如：raw_contents=urllib.urlopen('http://www.nltk.org/').read().

from urllib import urlopen
import re
def content(url):
raw_contents=urlopen('http://www.nltk.org/').read()
return re.findall(r'<.*>(.*)<.*>{1,},raw_contents)

9.将一些文字保存到文件corpus.txt。定义一个函数load（f）以要读取的文件名为唯一参数，返回包含文件中文本的字符串。
a.使用nltk.regexp_tokenize()创建一个分词器分割这个文本中的各种标点符号。使用一个多行的正则表达式，行内要有注释，使用verbose标志(?x)。

import nltk
def load(file):
	f=open(file)
	return f.read()

content=load('corpus.txt')
pattern=r'''(?x)
\w*(\.|\,|\?|\:|\;)'''
nltk.regexp_tokenize(content,pattern)

b.使用nltk.regexp_tokenize()创建一个分词器，分割以下几种表达式。货币金额;日期;个人和组织的名称。

import nltk
text="The book is $5"
pattern=r'''(?x)
(\$\d)|([A-Z][a-z]{1,})'''
nltk.regexp_tokenize(text,pattern)

10.将先面的循环改为链表推导。

sent=['The','dog','gave','John','the','newspaper']
result=[]
for word in sent:
word_len=(word,len(word))
result.append(word_len)
result
[('the',3),('dog',3),('gave',4),('John',4),('the',3),('newspaper',9)]

sent=['The','dog','gave','John','the','newspaper']
result=[]
[(word,len(word) for word in sent]

11.定义一个字符串raw包含你自己选择的句子。现在，分裂raw的一些字符以外的空间，例如：‘s'。

sorry,I don't understand the meaning of the problem.But I think it is easy to solve.So,I didn't solve it.

12.编写一个for循环输出一个字符串的字符，每行一个。

string='this is a string'

for w in string:

print w

18.阅读语料库中的一些文字，为它们分词，输出其中出现的所有wh-类型词的列表。按顺序输出他们。在这个列表中含有因大小写或标点符号而重复的词吗？

from nltk.corpus import brown
pattern=r'''(?x)
(wh[a-z]{1,})|(Wh[a-z]{1,})'''
text=nltk.Text(brown.word(categories='news'))
test=str(text[300:1000])
nltk.regexp_tokenize(test,pattern)

19.创建一个文件，包含词汇和频率，其中每一行包含一个词，一个空格和一个整数。如：fuzzy 53.使用open(filename).readline()将文件读入python链表。接下来，使用split（）将每一行分成两个字段，并使用int（）将其中的数字转换为整数。

sents=[sent[i].split() for i in range(8)]
words=[[sents[i][0],int(sents[i][1])] for i in range(8)]

20.编写代码以访问喜爱的网页，并从中提取一些文字。例如：访问一个天气网站，提取你所在的城市今天的最高温度。

from urllib import urlopen

url='http://www.weather.com.cn/weather/101020100.shtml'

html=urlopen(url).read()

pattern=r'''(?x)....................'''

nltk.regexp_tokenize(html,pattern)

qq_34505594

关注

0
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
python自然语言处理第三章习题

+：项目中的一个或多个实例*：项目中的零个或多个实例 +和*有时被称作闭包^：匹配字符串的开始\s：匹配所有空白字符\w：匹配词中的字符，字母，数字，下划线\W:匹配所有字母、数字、下划线以外的字符\S:是\s的补\b：词边界（零宽度）\d：任一十进制数字\D：任何非数字字符\t：制表符8.编写一个工具函数，以url为参数，返回删除所有HTML标记的URL内容。使用那个url.urlopen访问...
复制链接

扫一扫