一些爬虫实用小技巧--记录自己踩过的坑

最新推荐文章于 2024-06-14 16:36:56 发布

大牛壮壮

最新推荐文章于 2024-06-14 16:36:56 发布

阅读量876

点赞数 1

分类专栏：爬虫 python基础文章标签： requests 爬虫总结小技巧验证码登陆

本文链接：https://blog.csdn.net/zm429438709/article/details/81178035

版权

python基础同时被 2 个专栏收录

5 篇文章 0 订阅

订阅专栏

爬虫

3 篇文章 0 订阅

订阅专栏

一、request模块总结

1、HTTP请求方式requests实现

r = requests.put('http://www.baidu.com',data={key:value})
r = requests.get('http://www.baidu.com')
r = requests.post('http://www.baidu.com')
r = requests.delete('http://www.baidu.com')
r = requests.head('http://www.baidu.com')
r = requests.options('http://www.baidu.com')

2、响应和编码

有时候requests的response.text会乱猜测编码,可以用chardet一行解决，实际中cv即可

r = requests.get('http://www.baidu.com')
r.encoding = chardet.detect(r.content)['encoding']

3、url参数的编码解码

url中的中文需要编码(变成unicode码)才能被识别,可以使用urllib.parse.urlencode()

同时和它相对的,有时候Fiddler或者charlse抓到一大串参数字符串,可以用urllib.parse.parse_qs变成参数对象分析，或者传参

from urllib import parse
url = "https://so.csdn.net/so/search/s.do?q=redis&t=%20&u="
param_dict = parse.parse_qs(url.split('?')[-1])
print(param_dict)
{'q': ['redis'], 't': [' ']}

4、编码和解码概念

字符串的编码decode，解码encode，编码我理解为变成unicode码(万国码),解码就是解析unicode码

utf-8只是小部分的码，gbk，gb2312等等变成unicode码才能流通通用,同样unicode码我们要变成我们看懂的码(gbk)才能看懂

5、模拟登陆验证码刷新

有一些网站的验证码点击会刷新，这些网站设计可能是每刷新一次都会设置一个cookie来追踪验证码，否则可以用旧的验证码登陆是不行的，解决思路是：

a、先找到图片的接口，发一个GET请求取获取到验证码,

img_url = 'https://so.gushiwen.org/RandCode.ashx'
s = requests.session()
img_res = s.get(url=img_url, headers=headers)
content = img_res.content # 用二进制保存图片
with open('img.jpg', 'wb') as fp:
    fp.write(content)

b、获取验证码响应response的session或者cookie，并一起发送post登陆请求取登陆

post_cookies = img_res.cookies #获取下载的验证码的cookie,这一步是核心

img_code = TestFunc() # TestFunc是菲菲打码平台的API函数
#
data = {
    'email': '你的邮箱',
    'pwd': '你的密码',
    'denglu': '登陆',
    'code': img_code
}

post_url = 'https://so.gushiwen.org/user/login.aspx'
# 携带之前验证码的cookie取登陆，如果没有cookie是会报错的，如果现在才获取cookie，又和之前的验证码cookie对应不上
login_res = requests.post(url=post_url, cookies=post_cookies, headers=headers, data=data)
print(login_res.text)

全面代码如下:

import ssl
import time
from fateadm_api import TestFunc
import requests
import urllib.request
import urllib.parse

proxy_url = 'http://kps.kdlapi.com/api/getkps/?orderid=903231446033175&num=100&pt=1&sep=1'

proxy_res = requests.get(proxy_url)

print(proxy_res.text)

proxy = {
    'HTTPS': proxy_res.text,
}

headers = {
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:61.0) Gecko/20100101 Firefox/61.0',
}

# 先登录获得cookie

TestFunc()

# req = urllib.request.Request(headers=headers, url=get_url)
#
# cookie_jar = http.cookiejar.CookieJar()
#
# handler = urllib.request.HTTPCookieProcessor(cookie_jar)
#
# opener = urllib.request.build_opener()
#
# get_res = opener.open(req)
#
# content = get_res.read().decode('utf-8')
#
# print(cookie_jar)


ssl._create_default_https_context = ssl._create_unverified_context

img_url = 'https://so.gushiwen.org/RandCode.ashx'
s = requests.session()
img_res = s.get(url=img_url, headers=headers)
content = img_res.content
with open('img.jpg', 'wb') as fp:
    fp.write(content)

post_cookies = img_res.cookies

img_code = TestFunc()
#
data = {
    'email': '',
    'pwd': '',
    'denglu': '登陆',
    'code': img_code
}

post_url = 'https://so.gushiwen.org/user/login.aspx'
login_res = requests.post(url=post_url, cookies=post_cookies, headers=headers, data=data)

print(login_res.text)
# print(img_res.cookies)


# post_res = s.post()
# print(s.cookies)

6、有时候爬虫会报ssl...的错误，大概意思是你的ssl验证不安全，可以通过设置不验证ssl证书(cv下面代码)或者关闭charles解决

import ssl
ssl._create_default_https_context = ssl._create_unverified_context

大牛壮壮

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
一些爬虫实用小技巧--记录自己踩过的坑

一、request模块总结1、HTTP请求方式requests实现r = requests.put('http://www.baidu.com',data={key:value})r = requests.get('http://www.baidu.com')r = requests.post('http://www.baidu.com')r = requests.delete('...
复制链接

扫一扫

专栏目录