初步实现网页抓取这里写代码片
功能
读取网页
con = urllib.urlopen('网址链接').read()
在网址链接所指向的页面找到要找的内容
找到artitle所在的位置,返回的是一个整数,titlec那里的17和10,是将选取的位置向后移动17位,向前移动10位
title_start = conn.find(r'artititle">')
title_end = conn.find(r'</',title_start)
titlec = conn[title_start+17:title_end-10]
源代码
实现抓取湖南大学就业网的初步功能(后期要完善,使用正则表达式)
import urllib
import types
url = ['']*60
info = ['']*60
con = ['']*4
con[0] = urllib.urlopen('http://scc.hnu.edu.cn/newsjob!getMore.action?p.currentPage=1&Lb=1').read()
con[1] = urllib.urlopen('http://scc.hnu.edu.cn/newsjob!getMore.action?p.currentPage=2&Lb=1').read()
con[2] = urllib.urlopen('http://scc.hnu.edu.cn/newsjob!getMore.action?p.currentPage=3&Lb=1').read()
con[3] = urllib.urlopen('http://scc.hnu.edu.cn/newsjob!getMore.action?p.currentPage=4&Lb=1').read()
#文章函数
def article(j):
conn = urllib.urlopen(url[j]).read()
#文章标题
title_start = conn.find(r'artititle">')
title_end = conn.find(r'</',title_start)
titlec = conn[title_start+17:title_end-10]
#单位名称
danwei_start = conn.find(r'-->',title_end)
danwei_end = conn.find(r'<!--',danwei_start)
danwei = conn[danwei_start+3:danwei_end]
#招聘地点
addr_start1 = conn.find(r'height25">',danwei_end)
addr_start2 = conn.find(r'<td>',addr_start1)
addr_end = conn.find(r'</',addr_start2)
addr = conn[addr_start2+29:addr_end-100]
#开始时间
time_start_a = conn.find(r'height25">',addr_end)
time_start_b = conn.find(r'<td>',time_start_a)
time_start_c = conn.find(r'</',time_start_b )
time_start = conn[time_start_b+30:time_start_c-15]
#结束时间
time_end_a = conn.find(r'height25">',time_start_c)
time_end_b = conn.find(r'<td>',time_end_a)
time_end_c = conn.find(r'</',time_end_b )
time_end = conn[time_end_b+29:time_end_c-15]
info[j] = 'danwei:'+danwei+'\naddress:'+addr+'\nstart-time:'+time_start+'\nend-time:'+time_end
print info[j]
#链接函数
def link(p):
i = p*15
title = con[p].find(r'<a title=')
href = con[p].find(r'href=',title)
html = con[p].find(r'target',href)
t = (p+1)*15
while i<t and title != -1 and href != -1:
url[i] = 'http://scc.hnu.edu.cn/' + con[p][href + 6:html-3]
title = con[p].find(r'<a title=',html)
href = con[p].find(r'href=',title)
html = con[p].find(r'target',href)
i = i + 1
#页面函数
#主函数
u = 0
while u <4:
link(u);
u = u + 1
j = 0
while j<60:
article(j)
j = j + 1
实现效果
设置的是显示60条记录
danwei:蓝网科技有限公司
address:东风多功能报告厅
start-time:2015-06-26 09:00
end-time:2015-06-26 11:30
danwei:湖南大学毕业生就业指导中心
address: 化学化工学院B106会议室
start-time:2015-06-25 10:00
end-time:2015-06-25 11:30
danwei:中建三局集团有限公司
address:东风多功能报告厅
start-time:2015-06-24 15:00
end-time:2015-06-24 17:30
danwei:北京华图宏阳教育文化发展股份有限公司长沙分公司
address:东风多功能报告厅
start-time:2015-06-23 19:00
end-time:2015-06-23 21:00
danwei:湖南大学毕业生就业指导中心
address: 无
start-time:2015-06-21 08:00
end-time:2015-06-21 21:00
danwei:爱唯尔(上海)企业发展有限公司
address:东风多功能报告厅
start-time:2015-06-18 09:00
end-time:2015-06-18 11:30
danwei:中建三局第二建设工程有限责任公司
address:东风多功能报告厅
start-time:2015-06-17 15:00
end-time:2015-06-17 17:30
danwei:长沙埃索凯化工有限公司
address:东风多功能报告厅
start-time:2015-06-16 19:00
end-time:2015-06-16 21:30
danwei:旭辉集团长沙事业部
address: 复临舍201
start-time:2015-06-16 19:00
end-time:2015-06-16 21:00