文章标题

最新推荐文章于 2022-10-30 12:21:08 发布

诺亚废船

最新推荐文章于 2022-10-30 12:21:08 发布

阅读量243

点赞数

分类专栏： python爬虫

本文链接：https://blog.csdn.net/m0_37752335/article/details/77988777

版权

python爬虫专栏收录该内容

5 篇文章 0 订阅

订阅专栏

Python Requests库

Selenium+Phantomjs的使用

Python爬虫的Requests库

主要应用于requests.get()和requests.post()的使用。
GET请求从服务器后端获得网页相关信息。
POST请求主要通过发送表单给服务器，例如登陆密码。POST之后会返回一定的信息，例如一个新网站或者相关数据。
其中的参数包括：headers头部信息（包括浏览器的相关信息头）
    headers = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
       'Accept-Encoding': 'gzip, deflate, compress',
       'Accept-Language': 'en-us;q=0.5,en;q=0.3',
       'Cache-Control': 'max-age=0',
       'Connection': 'keep-alive',
       'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:22.0) Gecko/20100101 Firefox/22.0'}

   cookies = {'testCookies_1': 'Hello_Python3', 'testCookies_2': 'Hello_Requests'}

proxies
采集时为避免被封IP，经常会使用代理。requests也有相应的proxies属性。

import requests

proxies = {
“http”: “http://10.10.1.10:3128“,
“https”: “http://10.10.1.10:1080“,
}

requests.get(“http://www.zhidaow.com“, proxies=proxies)
如果代理需要账户和密码，则需这样：

proxies = {
“http”: “http://user:pass@10.10.1.10:3128/“,
}

requests.get(‘http://www.dict.baidu.com/s‘, params={‘wd’: ‘python’}) #GET参数实例

requests.post(‘http://www.itwhy.org/wp-comments-post.php‘, data={‘comment’: ‘测试POST’}) #POST参数

Json数据的处理
像urllib和urllib2，如果用到json，就要引入新模块，如json和simplejson，但在requests中已经有了内置的函数，r.json()。就拿查询IP的API来说：

r = requests.get(‘http://ip.taobao.com/service/getIpInfo.php?ip=122.88.60.28‘)
r.json()[‘data’][‘country’]
‘中国’

r.status_code #响应状态码
r.raw #返回原始响应体，也就是 urllib 的 response 对象，使用 r.raw.read() 读取
r.content #字节方式的响应体，会自动为你解码 gzip 和 deflate 压缩
r.text #字符串方式的响应体，会自动根据响应头部的字符编码进行解码
r.headers #以字典对象存储服务器响应头，但是这个字典比较特殊，字典键不区分大小写，若键不存在则返回None

特殊方法

r.json() #Requests中内置的JSON解码器
r.raise_for_status() #失败请求(非200响应)抛出异常

requests模拟登陆

import requests
from bs4 import BeautifulSoup

url = "http://www.v2ex.com/signin"
UA = "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.13 Safari/537.36"

header = { "User-Agent" : UA,
           "Referer": "http://www.v2ex.com/signin"
           }

v2ex_session = requests.Session()
f = v2ex_session.get(url,headers=header)

soup = BeautifulSoup(f.content,"html.parser")
once = soup.find('input',{'name':'once'})['value']
print(once)

postData = { 'u': 'whatbeg',
             'p': '*****',
             'once': once,
             'next': '/'
             }

v2ex_session.post(url,
                  data = postData,
                  headers = header)

f = v2ex_session.get('http://www.v2ex.com/settings',headers=header)
print(f.content.decode())

诺亚废船

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
文章标题

Python Requests库Selenium+Phantomjs的使用Python爬虫的Requests库主要应用于requests.get()和requests.post()的使用。GET请求从服务器后端获得网页相关信息。POST请求主要通过发送表单给服务器，例如登陆密码。POST之后会返回一定的信息，例如一个新网站或者相关数据。其中的参数包括：headers头部信息（包括浏览器的相关信
复制链接

扫一扫