python3 中requets 使用代理ip进行爬虫

最新推荐文章于 2024-10-07 08:22:02 发布

代码是一生的追求（找工作版）

最新推荐文章于 2024-10-07 08:22:02 发布

阅读量380

点赞数

分类专栏：爬虫文章标签： python 爬虫

本文链接：https://blog.csdn.net/qq_40389620/article/details/88899196

版权

爬虫专栏收录该内容

0 篇文章 0 订阅

订阅专栏

为什么会想写这个称不上技术的技术文章呢？首先是博主前不久没使用好代理ip，浪费了一大笔银子，最近才弄懂python3版本中如何正确使用代理ip。

先看博主之前写的代码吧：

#encoding:utf-8
import requests
import sys
import io
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='GB18030')
proxie = {"http":"140.143.156.166:1080"}
header = {
            "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0",
            "Host":"www.xxx.com",
            "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
            "Accept-Language":"zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2",
            "Accept-Encoding":"gzip, deflate, br",
            "DNT":"1",
            "Connection":"keep-alive",
            "Upgrade-Insecure-Requests":"1",
            "Cache-Control":"max-age=0, no-cache",
            "Pragma":"no-cache"
}
res = requests.get(url,headers=header,proxies=proxie)
res.encoding = "utf-8"
print(res.status_code)
print(res.text)

这样子写是无法使用代理ip的，虽然能正常访问网页，但却是用自己的ip地址取访问，并不是用代理ip。很容易封ip，所以那次项目花了博主10天才爬完，浪费了时间也浪费了兜里的银子。所以在网上查阅资料许多资料，代码变成这样：

#encoding:utf-8
import requests
import sys
import io
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='GB18030')
proxie = {"http":"140.143.156.166:1080","https":"140.143.156.166:1080"}
header = {
            "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0",
            "Host":"www.xxx.com",
            "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
            "Accept-Language":"zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2",
            "Accept-Encoding":"gzip, deflate, br",
            "DNT":"1",
            "Connection":"keep-alive",
            "Upgrade-Insecure-Requests":"1",
            "Cache-Control":"max-age=0, no-cache",
            "Pragma":"no-cache"
}
res = requests.get(url,headers=header,proxies=proxie)
res.encoding = "utf-8"
print(res.status_code)
print(res.text)

这次代码加上了https proxie,但原程连接一直被拒绝，所以就有了下面最终正确使用proxies的代码：

#encoding:utf-8
import requests
import sys
import io
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='GB18030')
proxie = {"http":"140.143.156.166:1080","https":"140.143.156.166:1080"}
header = {
            "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0",
            "Host":"www.xxx.com",
            "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
            "Accept-Language":"zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2",
            "Accept-Encoding":"gzip, deflate, br",
            "DNT":"1",
            "Connection":"keep-alive",
            "Upgrade-Insecure-Requests":"1",
            "Cache-Control":"max-age=0, no-cache",
            "Pragma":"no-cache"
}
res = requests.get(url,verify=False,headers=header,proxies=proxie)
res.encoding = "utf-8"
print(res.status_code)
print(res.text)

加上verify = False,不验证SSL，就好了。这个时候去用爬虫访问网站用的是代理IP地址了！！！

可以开启愉快的爬虫之旅了~~~