为什么会想写这个称不上技术的技术文章呢?首先是博主前不久没使用好代理ip,浪费了一大笔银子,最近才弄懂python3版本中如何正确使用代理ip。
先看博主之前写的代码吧:
#encoding:utf-8 import requests import sys import io sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='GB18030') proxie = {"http":"140.143.156.166:1080"} header = { "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0", "Host":"www.xxx.com", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language":"zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2", "Accept-Encoding":"gzip, deflate, br", "DNT":"1", "Connection":"keep-alive", "Upgrade-Insecure-Requests":"1", "Cache-Control":"max-age=0, no-cache", "Pragma":"no-cache" } res = requests.get(url,headers=header,proxies=proxie) res.encoding = "utf-8" print(res.status_code) print(res.text)
这样子写是无法使用代理ip的,虽然能正常访问网页,但却是用自己的ip地址取访问,并不是用代理ip。很容易封ip,所以那次项目花了博主10天才爬完,浪费了时间也浪费了兜里的银子。所以在网上查阅资料许多资料,代码变成这样:
#encoding:utf-8 import requests import sys import io sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='GB18030') proxie = {"http":"140.143.156.166:1080","https":"140.143.156.166:1080"} header = { "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0", "Host":"www.xxx.com", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language":"zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2", "Accept-Encoding":"gzip, deflate, br", "DNT":"1", "Connection":"keep-alive", "Upgrade-Insecure-Requests":"1", "Cache-Control":"max-age=0, no-cache", "Pragma":"no-cache" } res = requests.get(url,headers=header,proxies=proxie) res.encoding = "utf-8" print(res.status_code) print(res.text)
这次代码加上了https proxie,但原程连接一直被拒绝,所以就有了下面最终正确使用proxies的代码:
#encoding:utf-8 import requests import sys import io sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='GB18030') proxie = {"http":"140.143.156.166:1080","https":"140.143.156.166:1080"} header = { "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0", "Host":"www.xxx.com", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language":"zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2", "Accept-Encoding":"gzip, deflate, br", "DNT":"1", "Connection":"keep-alive", "Upgrade-Insecure-Requests":"1", "Cache-Control":"max-age=0, no-cache", "Pragma":"no-cache" } res = requests.get(url,verify=False,headers=header,proxies=proxie) res.encoding = "utf-8" print(res.status_code) print(res.text)
加上verify = False,不验证SSL,就好了。这个时候去用爬虫访问网站用的是代理IP地址了!!!
可以开启愉快的爬虫之旅了~~~