python自然语言处理第四章笔记

最新推荐文章于 2024-04-06 00:39:11 发布

qq_34505594

最新推荐文章于 2024-04-06 00:39:11 发布

阅读量440

点赞数

分类专栏： Python

本文链接：https://blog.csdn.net/qq_34505594/article/details/79496011

版权

Python 专栏收录该内容

16 篇文章 1 订阅

订阅专栏

1.赋值。传递对象的引用。**

#改变链表中嵌套链表内的一个项目，全部都改变。引用同一个链表。

>>>empty=[]
>>>nested=[empty,empty,empty]
>>>nested
[[],[],[]]
>>>nested[1].append('python')
>>>nested
[['python'],['python'],['python']]

#分配新值给链表中的某个元素时，不会传送给其他元素。**

>>>netste=[[]]*3

>>>nested[1].append('python')

>>>nested[1]=['Monty'] #id值改变

>>>nesten

[['python'],['Monty'],['python']]

一开始用含有3个引用的链表，每个引用只想一个空链表对象。然后，通过给它追加’python‘来修改这个对象，结果生成了3个指向同一个链表对象['python']的链表。使用新对象['Monty']的引用来覆盖3个对象中的一个。最后一步修改嵌套链表内的3个引用对象中的1个。然而，['python']对象并没有改变，仍然是在嵌套链表中的两处位置被引用。关键是要明白通过对象引用来修改对象与通过覆盖对象引用之间的区别。

重要：要从链表foo复制项目到新的链表bar中，可以写成bar=foo[:]，这会复制链表中对象引用。若只想复制结构而不复制任何对象引用，可以使用copy.deepcopy().（深复制）
2.序列

for item in set(s).defference(t) #遍历在集合s中却不在集合t中的元素

for item in random.shuffle(s) #按随机顺序遍历s中的元素

元组

>>>words=['I','turned','off','the','spectrotoute']

>>>words[2],words[3],words[4]=words[3],words[4],words[2]

>>>words

['I','turned','the','spectroroute','off']

3.从文件中读取文本。

import re
def get_text(file):
	text=open(file).read()
	text=re.sub('\s+',' ',text)
	text=re.sub(r'<.*>?',' ',text)
	return text

4.参数传递

def set_up(word,properties):
word='locat'
properties.append('noun')
properties=5
>>>w=' '
>>>p=[]
>>>set_up(w,p)
>>>w
' '
>>>p
['noun']

#w的值改变，p的值不变。w的值被分配到一个新的变量word，在函数内部word值被修改。然而，这种变化并没有传递给w。

这个参数传递过程与下面的赋值类似：

w=' '

word=w

word='locat'

' '

5.作为参数的函数。

sent=['take','care','of ','the','sense']
def extract_property(prop):
return [prop(word) for word in sent]

extract_property(len)

def last_letter(word):
return word[-1]

extract_property(last_letter) #函数作为参数时，省略括号。

等价于 extract_property(lambda w :w[-1])

6.调用调试器，监视程序的运行，指定程序暂停运行的行号。

import pdb

import mymoudle

pdb.run('mymoudle.myfunction()')

结果会给出一个提示，可以输入命令。输入help来查看命令的完整列表，输入step将执行当前行然后停止。如果当前行调用一个函数，它将进入这个函数并停止在第一行。类系的输入next，但他会在当前函数中的下一行停止执行。break命令可用于创建或列出断点。输入continue会继续执行直到遇到下一个断点。输入变量名称可以检查任何变量值。
7.WordNet上位词层次。计数以给定同义词集s为根的上位词层次的大小。找到s的每个下位词的大小，然后将它们加到一起。

一：递归调用

def size1(s):

return 1+sum(size1(child) for child in s.hyponyms())

二：迭代

def size2(s):

layer=[s]

total=0

while layer:

total+=len(layer)

layer=[ h for c in layer for h in c.hyponyms()]

return total

8.构建一个字母查找树：一个递归函数建立一个嵌套的字典结构，每一级嵌套包含给定前缀的所有单词，而子查找树含有所有可能的后续词。

def inset(trie,key,value):
if key:
first,rest=key[0],key[1:]
if first not in trie:
trie[first]={}
inset (trie[first],rest,value)
else:
trie['value']=value
>>>trie=nltk.defaultdict(dict)
>>>insert(trie,'chat','cat')

9.简单的全文检索系统。

def raw(file):
contents=open(file).read()
contents=re.sub(r'<.*?>',' ',contents)
contents=re.sub('\s+',' ',contents)
return contents

def snippet(doc,term):
text=' '*30+raw(doc)+' '*30
pos=text.index(term)
return text[pos-30:pos+30]
print "Buding Index..."
files=nltk.corpus.movie_reviews.abspaths()
idx=nltk.Index((w,f) for f in files for w in raw(f).split())

query=' '
while query!="quit":
query=raw_input("query> ")
if queryin idx:
for dc in idx[query]:
print snippet(doc,query)
else:
print "Not found"