*:项目中的零个或多个实例 +和*有时被称作闭包
^:匹配字符串的开始
\s:匹配所有空白字符
\w:匹配词中的字符,字母,数字,下划线
\W:匹配所有字母、数字、下划线以外的字符
\S:是\s的补
\b:词边界(零宽度)
\d:任一十进制数字
\D:任何非数字字符
\t:制表符
8.编写一个工具函数,以url为参数,返回删除所有HTML标记的URL内容。使用那个url.urlopen访问的url内容,例如:raw_contents=urllib.urlopen('http://www.nltk.org/').read().
from urllib import urlopen
import re
def content(url):
raw_contents=urlopen('http://www.nltk.org/').read()
return re.findall(r'<.*>(.*)<.*>{1,},raw_contents)
a.使用nltk.regexp_tokenize()创建一个分词器分割这个文本中的各种标点符号。使用一个多行的正则表达式,行内要有注释,使用verbose标志(?x)。
import nltk
def load(file):
f=open(file)
return f.read()
content=load('corpus.txt')
pattern=r'''(?x)
\w*(\.|\,|\?|\:|\;)'''
nltk.regexp_tokenize(content,pattern)
b.使用nltk.regexp_tokenize()创建一个分词器,分割以下几种表达式。货币金额;日期;个人和组织的名称。
import nltk
text="The book is $5"
pattern=r'''(?x)
(\$\d)|([A-Z][a-z]{1,})'''
nltk.regexp_tokenize(text,pattern)
sent=['The','dog','gave','John','the','newspaper']
result=[]
for word in sent:
word_len=(word,len(word))
result.append(word_len)
result
[('the',3),('dog',3),('gave',4),('John',4),('the',3),('newspaper',9)]
sent=['The','dog','gave','John','the','newspaper']
result=[]
[(word,len(word) for word in sent]
11.定义一个字符串raw包含你自己选择的句子。现在,分裂raw的一些字符以外的空间,例如:‘s'。
sorry,I don't understand the meaning of the problem.But I think it is easy to solve.So,I didn't solve it.
string='this is a string'
for w in string:
print w
from nltk.corpus import brown
pattern=r'''(?x)
(wh[a-z]{1,})|(Wh[a-z]{1,})'''
text=nltk.Text(brown.word(categories='news'))
test=str(text[300:1000])
nltk.regexp_tokenize(test,pattern)
sents=[sent[i].split() for i in range(8)]
words=[[sents[i][0],int(sents[i][1])] for i in range(8)]
from urllib import urlopen
url='http://www.weather.com.cn/weather/101020100.shtml'
html=urlopen(url).read()
pattern=r'''(?x)....................'''
nltk.regexp_tokenize(html,pattern)