聚类的算法没什么可说的。重点说一下用beautifulSoup抓取网页时遇到的问题。
问题1:
这个问题折腾了很久,抓取到的网页中文都是乱码,非常的蛋疼。用了网上各种的方法均没有用。最后发现是ecilipse环境是GBK的,网页抓的数据室utf-8的。把工程属性中的编码改为utf-8,成功。网上说遇到GBK的网页很容易出现乱码,要用decode,encode解决。暂时没有用到。
问题2:用beautiful soup查找文本内容有很多种方法:
from BeautifulSoup import BeautifulSoup
import re
doc = ['<html><head><title>Page title</title></head>',
'<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
'<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
'</html>']
soup = BeautifulSoup(''.join(doc))
print soup.prettify()
# <html>
# <head>
# <title>
# Page title
# </title>
# </head>
# <body>
# <p id="firstpara" align="center">
# This is paragraph
# <b>
# one
# </b>
# .
# </p>
# <p id="secondpara" align="blah">
# This is paragraph
# <b>
# two
# </b>
# .
# </p>
# </body>
# </html>
查找上面文档的方法:
titleTag = soup.html.head.title
titleTag
# <title>Page title</title>
titleTag.string
# u'Page title'
len(soup('p'))
# 2
soup.findAll('p', align="center")
# [<p id="firstpara" align="center">This is paragraph <b>one</b>. </p>]
soup.find('p', align="center")
# <p id="firstpara" align="center">This is paragraph <b>one</b>. </p>
soup('p', align="center")[0]['id']
# u'firstpara'
soup.find('p', align=re.compile('^b.*'))['id']
# u'secondpara'
soup.find('p').b.string
# u'one'
soup('p')[1].b.string
# u'two'
我看到.string这种格式很不习惯。a.string是tag包含的字符串,a['***']是tag的***属性的值
如果需要得到某个tag,只需要调用soup('xxx')或者soup.findall("xxx")完全一样,返回的是list
某个tag的子tag的list是:a.contents
a.attrs是tag的属性元组的list,可以用dict转换成字典。然后就能用元组的第一个元素索引了:dict(a.attrs)['asd'],
但是一般都简化为a["asd"]。
如这段代码:
for div in soup('div'):
# Find table cells of bgverdanasmall class
if ('class' in dict(div.attrs) and div['class']=='information'):
items=[a.contents[0].contents[0].lower().strip( ) for a in div('a')]
具体tuple,list,dict的比较见:
http://blog.sina.com.cn/s/blog_5e3d9eb80100e5gh.html
问题3:python中单引号,双引号,三双引号的区别:单引号和双引号互补,单引号中可以有双引号,不用加转义
符。三双引号中所有格式保留,原样输出
问题4:unicode字符串输出问题:在字典,list或者元组中输出的是编码,索引到数据以后输出的是正确的字符,str
函数可以把unicode强制转换成ascII
问题5:字符串的join函数,和字符串的split函数,相反的功能,一个是用某串字符连接一个list,一个是把一个字符串
分割成一个list。
——————————————————————————————————————————————
discovering group的介绍:
Viewing Data in Two Dimensions:
文中设计了一个算法,使得把高位空间中的点近似的在2-dim平面上表示出来。这种表示是近似的。只是为了我们方
便画图(PIL),如:
from PIL import Image,ImageDraw
img=Image.new('RGB',(200,200),(255,255,255)) # 200x200 white background
draw=ImageDraw.Draw(img)
draw.line((20,50,150,80),fill=(255,0,0)) # Red line
draw.line((150,150,20,200),fill=(0,255,0)) # Green line
draw.text((40,80),'Hello!',(0,0,0)) # Black text
img.save('test.jpg','JPEG') # Save to test.jpg
书中的算法是:
1.random //随机生产近似坐标
2.calculate//计算近似坐标下,各点的距离。
3.compare and adjust//比较新算得的距离和准确距离并调整
4.repeat 2,3 until error is stable
文中python code:
def scaledown(data,distance=pearson,rate=0.01):
n=len(data)
# The real distances between every pair of items
realdist=[[distance(data[i],data[j]) for j in range(n)]
for i in range(0,n)]
outersum=0.0
# Randomly initialize the starting points of the locations in 2D
loc=[[random.random(),random.random( )] for i in range(n)]
fakedist=[[0.0 for j in range(n)] for i in range(n)]
lasterror=None
for m in range(0,1000):
# Find projected distances
for i in range(n):
for j in range(n):
fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2)
for x in range(len(loc[i]))]))
# Move points
grad=[[0.0,0.0] for i in range(n)]
totalerror=0
for k in range(n):
for j in range(n):
if j==k: continue
# The error is percent difference between the distances
errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k]#BUG
# Each point needs to be moved away from or towards the other
# point in proportion to how much error it has
grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm#可以理解为cos * errorterm
grad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm#sin * errorterm
# Keep track of the total error
totalerror+=abs(errorterm)
print totalerror
# If the answer got worse by moving the points, we are done
if lasterror and lasterror<totalerror: break
lasterror=totalerror
# Move each of the points by the learning rate times the gradient
for k in range(n):
loc[k][0]-=rate*grad[k][0]
loc[k][1]-=rate*grad[k][1]
return loc
这个代码有一个小漏洞:
errorterm的计算没有考虑真实距离为0的两点,或者距离很近的两点!
errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k]
errorterm不因该比fakedist[j][k]-realdist[j][k]大,所以应该改为:
errorterm=(fakedist[j][k]-realdist[j][k])/(realdist[j][k]+1)