获取User-Agent网址 + 在 urllib，request 设置代理

最新推荐文章于 2023-10-23 10:49:27 发布

雨神_Forrest

最新推荐文章于 2023-10-23 10:49:27 发布

阅读量506

点赞数

本文链接：https://blog.csdn.net/Crazy_005/article/details/108511698

版权

1. 获取User-Agent网址

搜索手机浏览器User-Agent信息，获取如下方的信息
'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'

手机浏览器User-Agent信息
https://blog.csdn.net/ccclll1990/article/details/17006159

常用的浏览器请求头User-Agent
https://blog.csdn.net/mouday/article/details/80182397

2. 在 urllib，request 设置代理

讲的蛮好： https://blog.csdn.net/shunqixing/article/details/80088028

通常防止爬虫被反主要有以下几个策略：

1.动态设置User-Agent（随机切换User-Agent，模拟不同用户的浏览器信息）

2.使用IP地址池：VPN和代理IP，现在大部分网站都是根据IP来ban的。

3.禁用Cookies（也就是不启用cookies middleware，不向Server发送cookies，有些网站通过cookie的使用发现爬虫行为）

4.可以通过COOKIES_ENABLED 控制 CookiesMiddleware 开启或关闭

5.设置延迟下载（防止访问过于频繁，设置为 2秒或更高）要明白爬虫重要的是拿到数据。

6.Google Cache 和 Baidu Cache：如果可能的话，使用谷歌/百度等搜索引擎服务器页面缓存获取页面数据。

7.使用 Crawlera（专用于爬虫的代理组件），正确配置和设置下载中间件后，项目所有的request都是通过crawlera发出。

urllib 代理示例:

import urllib.request
import random
#proxies = [{'http': 'http://124.231.50.56:8118'}]
#proxy = random.choice(proxies)
#设置代理操作器
proxy = urllib.request.ProxyHandler({'http':'http://116.7.176.75:8118'})
#构建新的请求器，覆盖默认opener
opener = urllib.request.build_opener(proxy, urllib.request.HTTPHandler)
urllib.request.install_opener(opener)
reponse = urllib.request.urlopen('http://www.baidu.com/s?wd=ip')
html_content = reponse.read().decode('utf-8')
#返回结果中查找“主机ip”看是否变更为代理ip
print(html_content)

requset 模块设置代理方法：

普通代理

import requests

# 根据协议类型，选择不同的代理
proxies = {
  "http": "http://12.34.56.79:9527",
  "https": "http://12.34.56.79:9527",
}

response = requests.get("http://www.baidu.com", proxies = proxies)
print response.text

私密代理

import requests

# 如果代理需要使用HTTP Basic Auth，可以使用下面这种格式：
proxy = { "http": "账号:密码@61.158.163.130:16816" }

response = requests.get("http://www.baidu.com", proxies = proxy)

print response.text

3. 为什么要设置代理？

运行程序可能出现下方两个异常

长时间未响应。urllib.error.URLError:<urlopen error[WinError 10060]由于连接方在一段时间后没有正确答复或连接的主机没有反应，连接尝试失败。>
对方服务器拒绝链接。 connectionResetError远程主机关闭了一个现有的连接。
解决换一个或购买付费接口

解决方案：专门写一个搜集ip代理网站免费信息的爬虫，把爬下来的代理

雨神_Forrest

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
获取User-Agent网址 + 在 urllib，request 设置代理

1. 获取User-Agent网址搜索手机浏览器User-Agent信息，获取如下方的信息'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'手机浏览器User-Agent信息https://blog.csdn.net/ccclll1990/article/details/17006159常用的浏览器请求头User-Agenthttp
复制链接

扫一扫