从零开始学网络爬虫之Requests库

最新推荐文章于 2024-10-23 10:12:48 发布

lxmanutd

最新推荐文章于 2024-10-23 10:12:48 发布

阅读量827

点赞数

CC 4.0 BY-SA版权

分类专栏：从零开始学网络爬虫文章标签： python 机器人爬虫网络爬虫数据

本文链接：https://blog.csdn.net/lxmanutd/article/details/53443228

从零开始学网络爬虫专栏收录该内容

14 篇文章

订阅专栏

本文介绍Python的Requests库基本用法，包括GET/POST请求、设置headers和user-agent、管理cookie及session等，帮助初学者快速掌握网络爬虫技能。

前言

从今天开始对我们将正式进入网络爬虫的领域，学习怎么网络爬虫的工具。在本节，主要介绍一下 requests 库的基本用法。requests库（http://www.python-requests.org）是一个擅长处理复杂的HTTP 请求，cookie，header（响应头和请求头）等内容的python第三方库。

注：Python 版本依然基于 2.7

官方文档

要了解更多可以参考

官方文档

安装

利用 pip 安装:

pip install requests

常见用法：

requests中get/post方法的使用，特别是头部编写（headers以及user-agent），以及cookie、session的使用。

1. get最基本的使用

url="http://sports.sina.com.cn/global/"
html=requests.get(url)
print r.status_code
print r.encoding
#print r.text
print r.cookies

2. post的使用

import requests
params={'email_addr':'ryan.e.mitchell@gmail.com'}
r=requests.post("http://post.oreilly.com/client/o/oreilly/forms/quicksignup.cgi",data=params)
print r.text

3. headers,特别是user-agent的使用

headers={'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.76 Mobile Safari/537.36',
         'Referer': 'http://www.mafengwo.cn/u/5354853/note.html'}
source=requests.get(url,headers=headers)

4. requests中cookie以及session的使用和对比

（1）利用保存的cookie来实现登陆

cookie结构如下：

#-*- coding: UTF-8 -*-
import requests
import re

#将cookies转换成字典形式，zhihu_cookie为保存的cookie文件，跟程序处在同一路径
def get_cookie():
    with open('zhihu_cookie','r') as f:
        cookies={}
        for line in f.read().split(';'):
            name,value=line.strip().split('=',1)  #1代表只分割一次
            cookies[name]=value 
        return cookies

s = requests.Session()
url = 'http://www.zhihu.com/#signin'
headers = {
    'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36',
    'Accept':'*/*',
    'Accept-Language':'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',
    'Accept-Encoding':'gzip, deflate, br',
    'Referer':'https://www.zhihu.com/'
    }
req2 = s.get(url, headers = headers, cookies = get_cookie(), verify=False)
html = req2.content

#将获取到的页面源码写入zhihu.html文件中
with open('zhihu.html','w') as fl:
    fl.write(html)

（2）先登陆，后获取cookie，之后就可以用cookie登陆

import requests
params={'username':'Ryan','password','password'}

r=requests.post("http://pythonscraping.com/pages/cookies/welcome.php",params)
print(r.cookies.get_dict())
r=requests.get("http://pythonscraping.com/pages/cookies/profile.php") #用cookie登陆
print r.text

或者先登陆，然后保存cookie，之后就可以用cookie登陆

import requests
import cookielib


Agent = 'Mozilla/5.0 (Windows NT 5.1; rv:33.0) Gecko/20100101 Firefox/33.0'
headers = {
    # "Host": "http://www.zhihu.com",
    "Referer": "https://www.zhihu.com/",
    'User-Agent': Agent
}

session = requests.Session()
session.cookies = cookielib.LWPCookieJar(filename='cookies')   #实例化cookieJar对象
try:
    session.cookies.load(ignore_discard=True)                  #尝试加载，
except:
    print("Cookie 未能加载")

if isLogin():
        print '您已经登录'					#如果已经登录，直接打印已经登录
else:
    login_page = session.post(post_url, data=postdata, headers=headers) #如果cookie过期或者无效，重新登录
    login_code = login_page.text
    print login_page.status
    print login_code
    session.cookies.save()						#保存cookie
html=session.get("https://www.zhihu.com/topic/19552832",headers=headers)  #之后就可以获取网页代码
print html.text

(3)用session登陆，之后session会保存会话

import requests
params={'username':'Ryan','password','password'}
session=requests.Session()
s=session.post("http://pythonscraping.com/pages/cookies/welcome.php",params)
print s.cookies.get_dict()
s=session.get("http://pythonscraping.com/pages/cookies/profile.php") #用cookie登陆
print s.text

从上面的例子中也可以看出，requests也是一个比较强大的工具，包含了urllib、urllib2中的功能，而且简单，易用。

应该熟练掌握requests以下功能：

（1）头部编写，user-agent的使用

（2）requests get与post的使用

（3）cookie与session会话管理