python爬虫（四）破解网站限制，想抓什么由你做主

最新推荐文章于 2023-10-11 09:46:16 发布

qq_23168063

最新推荐文章于 2023-10-11 09:46:16 发布

阅读量1.5k

点赞数

经常遇到网站对爬虫一类非用户访问做了限制，屏蔽爬虫，返回403禁止访问错误，解决方法。

网站为了加快速度，节省流量使用Gzip压缩传输网页的解码问题

编码混乱问题，异常处理

https://www.jd.com/robots.txt


User-agent: *                         //*代表所有 蜘蛛
Disallow: /?*                         //？*代表所有动态页面   即禁止所有代理抓取动态页面
Disallow: /pop/*.html                 //禁止抓取pop下的页面
Disallow: /pinpai/*.html?*            //禁止抓取品牌下的所有页面
User-agent: EtaoSpider                //E淘
Disallow: / 
User-agent: HuihuiSpider 
Disallow: / 
User-agent: GwdangSpider 
Disallow: / 
User-agent: WochachaSpider 
Disallow: /

urllib2便准库模块：

属于一个进阶的爬虫抓取模块，有非常多的方法

用到一个随机模块random的choice方法

模仿用户，用浏览器的访问网页行为。使用代理IP和假的用户头部信息

import urllib2

url = "http://blog.csdn.net/wzy0623"
#html = urllib.urlopen(url)
req = urllib2.Request(url)
req.add_header("User-Agent","Mozilla/5.0 (Windows NT 6.3; … Gecko/20100101 Firefox/54.0")
req.add_header("GET",url)
req.add_header("Host","static.blog.csdn.net")
req.add_header("Refer","http://static.blog.csdn.net/")

html = urllib2.urlopen(req)

print html.read()

使用健对值

my_headers = {
	"User-Agent":"Mozilla/5.0 (Windows NT 6.3; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0",
	"GET":url,
	"Host":"blog.csdn.net",
	"Refer":"http://blog.csdn.net/"
	}


req = urllib2.Request(url,headers=my_headers)

html = urllib2.urlopen(req)

print html.read()

解读我们请求的Header信息的重要性，（不是服务器的头部信息）

代码的复用，封装及异常处理

import urllib2
import random

url = "http://blog.csdn.net/wzy0623"


my_headers =[
	"Mozilla/5.0 (Windows NT 6.3; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0"
]
//可以多添加几个，可以被认为多个用户在访问，不会被禁止爬虫
def get_content(url,headers):
	"""
	@获取403禁止访问的网页
	"""
	random_header=random.choice(headers)
	req = urllib2.Request(url)
	req.add_header("User-Agent",random_header)
	req.add_header("Host","blog.csdn.net")
	req.add_header("GET",url)
	req.add_header("Refer","http://blog.csdn.net/")

	content = urllib2.urlopen(req).read()
	return content

print get_content(url,my_headers)

qq_23168063

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
python爬虫（四）破解网站限制，想抓什么由你做主

经常遇到网站对爬虫一类非用户访问做了限制，屏蔽爬虫，返回403禁止访问错误，解决方法。网站为了加快速度，节省流量使用Gzip压缩传输网页的解码问题编码混乱问题，异常处理https://www.jd.com/robots.txtUser-agent: * //*代表所有蜘蛛Disallow: /?*
复制链接

扫一扫