写了个小脚本,抓取一下一个网页上的表格内容。
内容如下:
# -*- coding:utf-8 -*-
#!/usr/bin/env python
import sys,urllib
from bs4 import BeautifulSoup
reload(sys)
sys.setdefaultencoding( "utf-8" )
def parse_page(filename):
f = open(filename,'r')
g = open('result', 'a+')
html = f.read()
soup = BeautifulSoup(html, from_encoding='utf-8')
for i in soup.find_all('tr'):
for j in i.find_all('td'):
content = j.string
g.write(u'%s,' % content)
g.write('\n')
g.close()
f.close()
baseurl = "https://www.touzi.com/simu/"
count = 0
for i in range(1,75):
count = count + 1
url = "company-cid-3-g1-h1-i2-p" + str(count) + ".html"
final_url = baseurl + url
print final_url
f = open(url, 'w')
wp = urllib.urlopen(final_url)
print "start download... %s " % url
content = wp.read()
f.write(content)
f.close()
parse_page(url)
处理完的数据放到一个result文件中保存。 之后导入到excel中处理就可以了。 基本实现了功能。但是应该不具备什么通用性。