网络爬虫2__刘璐萍

llp8888

已于 2022-03-20 12:03:01 修改

阅读量47

点赞数 1

文章标签： python

于 2022-03-20 11:53:34 首次发布

本文链接：https://blog.csdn.net/weixin_61630482/article/details/123610299

版权

1.确定url并抓取页面代码
import urllib
import urllib2
page = 1
url = '网页名称' + str(page)
try:
request = urllib2.Request(url)
response = urllib2.urlopen(request)
print response.read()
except urllib2.URLError, e:
if hasattr(e,"code"):
print e.code
if hasattr(e,"reason"):
print e.reason

2.添加headers
import urllib
import urllib2

page = 1
url = '网友名称' + str(page)
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent' : user_agent }
try:
request = urllib2.Request(url,headers = headers)
response = urllib2.urlopen(request)
print response.read()
except urllib2.URLError, e:
if hasattr(e,"code"):
print e.code
if hasattr(e,"reason"):
print e.reason

3.提取某一页的所有段子

content = response.read().decode('utf-8')

pattern = re.compile('<div.*?author">.*?<a.*?<img.*?>(.*?)</ a>.*?<div.*?'+
'content">(.*?).*?</div>(.*?)<div class="stats.*?class="number">(.*?)</i>',re.S)

llp8888

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
网络爬虫2__刘璐萍

1.确定url并抓取页面代码# -*- coding:utf-8 -*-import urllibimport urllib2page = 1url = 'http://www.qiushibaike.com/hot/page/' + str(page)try: request = urllib2.Request(url) response = urllib2.urlopen(request) print response.read()except urllib2.U...
复制链接

扫一扫