总结一下最近爬虫遇到的问题

Violetttte

已于 2022-06-30 11:52:29 修改

阅读量3.6k

点赞数 5

文章标签： python html 爬虫代理模式 ip

于 2022-06-29 22:01:16 首次发布

本文链接：https://blog.csdn.net/Violetttte/article/details/125529883

版权

最近爬虫的过程中遇到了不少代码的报错和问题，以下会分别将自己遇到的报错的代码和思考的问题写上作为一个记录，如果能够帮助遇到相同困难的人那就更好了！

1.Max retries exceeded with url: /cn/vl_star.php?&mode=&s=aentc&page=9 (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1129)')))

这是你挂了vpn，并且urllib3库的版本大于1.26造成的

2.(Caused by ProxyError('Cannot connect to proxy.', FileNotFoundError(2, 'No such file or directory')))

proxy顾名思义是代理的意思，这段代码是在request.get的过程中出现的，即如果用户通过设置如下代理代码出现的问题:

proxies = {"HTTP":'HTTP://178.62.16.161:811',"HTTPS":'HTTP://178.62.16.161:811'}              
resp=requests.get('https://www.google.com',proxies=proxies)

3.Max retries exceeded with url: /cn/vl_star.php?&mode=&s=aentc&page=9 (Caused by ProxyError('Cannot connect to proxy.', ConnectionResetError(10054, '远程主机强迫关闭了一个现有的连接。', None, 10054, None)))

这段报错和上面的很类似，但是有一点差别，请读者仔细观看区别:

proxies={'http': 'http://178.62.16.161:811', "https": "http://178.62.16.161:811"}
resp = requests.get('https://www.google.com',proxies=proxies)

现在就1，2，3三段代码进行说明。

上述第1个代码报错的原因，是由于urllib3这个库的版本导致的，如果你的urllib3的库版本大于1.26，那么则会出现第一种情况。在此基础上，根据笔者的尝试发现，如果你写的代码和第3个类似，其中的字典的key的http或https都是小写，那么则会出现第3个代码的报错。如果你的字典的key的http或https都是大写，但是urllib3的库是大于1.26的，则会出现第2个代码的报错。

要解决这个问题，只需要pip install urllib3==1.25.11即可

在此笔者补充以下，部分网友可能会遇到pip install 库名出现以下两种代码报错的情况

代码报错一、:

The repository located at pypi.doubanio.com is not a trusted or secure host and is being ignored

代码报错二、:

 ERROR: Cannot unpack file C:\Users\ZHANGW~1\AppData\Local\Temp\pip-unpack-ikp51qe3\simple.htm (downloaded from C:\Users\ZHANGW~1\AppData\Local\Temp\pip-req-build-7u4k70qf, content-type: text/html); cannot detect archive format
ERROR: Cannot determine archive format of C:\Users\ZHANGW~1\AppData\Local\Temp\pip-req-build-7u4k70qf

一般来说，部分网友可能会在解决了第一个报错之后遇到第二个报错，这里可以统一的通过以下代码进行解决(pakegename即你要安装的库的名称):

pip install -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com pakegename

注意:如果你已经将urllib3的库重新安装为1.25.11的版本，那么你可以通过设置proxy代理ip，也可以通过直接挂vpn来进行request.get请求。

接下来是笔者解决了以上问题之后出现的其他一些问题。

1.于是我考虑换一种方式，使用selenium模拟登录，使用模拟登录基本解决了问题，但是笔者发现了一个小细节与刚刚提到的urllib3有关————笔者是通过虚拟环境安装的selenium，在按照selenium之前已经将urllib3重置为1.25.11的版本，安装完senium之后，笔者发现之前的代码又出现了报错，仔细一看才发现，selenium这个库必须要求urllib3大于1.26版本，否则无法使用，所以如果你两种方法都想尝试，记得这个可能出现的问题。

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
selenium 4.3.0 requires urllib3[secure,socks]~=1.26, but you have urllib3 1.25.11 which is incompatible.

2.在爬取某网页的时候发现返回的源代码和f12查看到的不一样，在设置了header头的cookie和agent之后，通过response.read().decode('utf-8')读取，结果发现报错了，报错代码如下: