programming collective intelligence(三) discovering group : pydev+eclipse+beautifulSoup+PIL

最新推荐文章于 2024-07-19 14:44:41 发布

zjmwqx

最新推荐文章于 2024-07-19 14:44:41 发布

阅读量1.3k

点赞数

分类专栏：数值分析文章标签： eclipse distance list python class import

本文链接：https://blog.csdn.net/zjmwqx/article/details/7019309

版权

数值分析专栏收录该内容

13 篇文章 0 订阅

订阅专栏

聚类的算法没什么可说的。重点说一下用beautifulSoup抓取网页时遇到的问题。

问题1：

这个问题折腾了很久，抓取到的网页中文都是乱码，非常的蛋疼。用了网上各种的方法均没有用。最后发现是ecilipse环境是GBK的，网页抓的数据室utf-8的。把工程属性中的编码改为utf-8，成功。网上说遇到GBK的网页很容易出现乱码，要用decode,encode解决。暂时没有用到。

问题2：用beautiful soup查找文本内容有很多种方法：

from BeautifulSoup import BeautifulSoup


import re



doc = ['<html><head><title>Page title</title></head>',


       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',


       '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',


       '</html>']


soup = BeautifulSoup(''.join(doc))



print soup.prettify()


# <html>


#  <head>


#   <title>


#    Page title


#   </title>


#  </head>


#  <body>


#   <p id="firstpara" align="center">


#    This is paragraph


#    <b>


#     one


#    </b>


#    .


#   </p>


#   <p id="secondpara" align="blah">


#    This is paragraph


#    <b>


#     two


#    </b>


#    .


#   </p>


#  </body>


# </html>

查找上面文档的方法：

titleTag = soup.html.head.title


titleTag


# <title>Page title</title>



titleTag.string


# u'Page title'



len(soup('p'))

# 2



soup.findAll('p', align="center")


# [<p id="firstpara" align="center">This is paragraph <b>one</b>. </p>]



soup.find('p', align="center")


# <p id="firstpara" align="center">This is paragraph <b>one</b>. </p>



soup('p', align="center")[0]['id']


# u'firstpara'



soup.find('p', align=re.compile('^b.*'))['id']


# u'secondpara'



soup.find('p').b.string


# u'one'



soup('p')[1].b.string


# u'two'

我看到.string这种格式很不习惯。a.string是tag包含的字符串，a['***']是tag的***属性的值

如果需要得到某个tag，只需要调用soup('xxx')或者soup.findall("xxx")完全一样，返回的是list

某个tag的子tag的list是：a.contents

a.attrs是tag的属性元组的list，可以用dict转换成字典。然后就能用元组的第一个元素索引了：dict(a.attrs)['asd']，

但是一般都简化为a["asd"]。

如这段代码：

for div in soup('div'):
# Find table cells of bgverdanasmall class
if ('class' in dict(div.attrs) and div['class']=='information'):

items=[a.contents[0].contents[0].lower().strip( ) for a in div('a')]

具体tuple,list,dict的比较见：

http://blog.sina.com.cn/s/blog_5e3d9eb80100e5gh.html

问题3：python中单引号，双引号，三双引号的区别：单引号和双引号互补，单引号中可以有双引号，不用加转义

符。三双引号中所有格式保留，原样输出

问题4：unicode字符串输出问题：在字典，list或者元组中输出的是编码，索引到数据以后输出的是正确的字符，str

函数可以把unicode强制转换成ascII

问题5：字符串的join函数,和字符串的split函数，相反的功能，一个是用某串字符连接一个list，一个是把一个字符串

分割成一个list。

——————————————————————————————————————————————

discovering group的介绍：

Viewing Data in Two Dimensions：

文中设计了一个算法，使得把高位空间中的点近似的在2-dim平面上表示出来。这种表示是近似的。只是为了我们方

便画图（PIL），如：

from PIL import Image,ImageDraw

img=Image.new('RGB',(200,200),(255,255,255)) # 200x200 white background
draw=ImageDraw.Draw(img)
draw.line((20,50,150,80),fill=(255,0,0)) # Red line
draw.line((150,150,20,200),fill=(0,255,0)) # Green line
draw.text((40,80),'Hello!',(0,0,0)) # Black text
img.save('test.jpg','JPEG') # Save to test.jpg

书中的算法是：

1.random //随机生产近似坐标

2.calculate//计算近似坐标下，各点的距离。

3.compare and adjust//比较新算得的距离和准确距离并调整

4.repeat 2,3 until error is stable

文中python code:

def scaledown(data,distance=pearson,rate=0.01):
n=len(data)
# The real distances between every pair of items
realdist=[[distance(data[i],data[j]) for j in range(n)]
for i in range(0,n)]
outersum=0.0

# Randomly initialize the starting points of the locations in 2D
loc=[[random.random(),random.random( )] for i in range(n)]
fakedist=[[0.0 for j in range(n)] for i in range(n)]
lasterror=None
for m in range(0,1000):
# Find projected distances
for i in range(n):
for j in range(n):
fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2)
for x in range(len(loc[i]))]))
# Move points
grad=[[0.0,0.0] for i in range(n)]
totalerror=0
for k in range(n):
for j in range(n):
if j==k: continue
# The error is percent difference between the distances
errorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k]#BUG
# Each point needs to be moved away from or towards the other
# point in proportion to how much error it has
grad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errorterm#可以理解为cos * errorterm
grad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm#sin * errorterm
# Keep track of the total error
totalerror+=abs(errorterm)
print totalerror
# If the answer got worse by moving the points, we are done
if lasterror and lasterror<totalerror: break
lasterror=totalerror
# Move each of the points by the learning rate times the gradient
for k in range(n):
loc[k][0]-=rate*grad[k][0]
loc[k][1]-=rate*grad[k][1]
return loc