【Python】Requests库网络爬取实战

最新推荐文章于 2021-03-09 14:33:37 发布

Lzy_First

最新推荐文章于 2021-03-09 14:33:37 发布

阅读量308

点赞数

文章标签： python

本文链接：https://blog.csdn.net/weixin_44100826/article/details/105566845

版权

【Python】Requests库网络爬取实战

今天学习了一下Python的Requests库玩网络爬虫，发现挺好玩的，记录一下

需要用到的库：requests

使用的是 kesci 平台，创建项目时需要使用以下的包
在这里插入图片描述

一、基础知识

简单介绍一下 Requests库进行网络爬虫吧

Requests库的对象属性

属性	说明
r.status_code	HTTP请求的返回状态，200表示成功，404表示失败
r.text	HTTP响应内容的字符创形式，即url对应的页面内容（代码展示）
r.encoding	从HTTP header中猜测的响应内容编码方式
r.encoding	从HTTP header中猜测的响应内容编码方式
r.apparent_encoding	从内容中分析出的响应内容编码方式（备选）
r.content	HTTP响应内容的二进制形式

Requests库的异常处理

属性	说明
r.raise_for_status	如果不是200，产生异常 requests.HTTPError

二、普通的网页爬取

# 普通网页爬取

import requests
# 异常处理
def getHTMLText(url) :
    try:
        r = requests.get(url)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return ("爬取成功！")
    except:
        print("爬取失败！")
url="https://xhsysu.edu.cn/"
getHTMLText(url)

二、搜索关键字爬取

参数 params ：使用键值对进行提交

# 爬取：关键词提交
import requests

def getHtmlSearch(url,keyword) :
    try:
        kv={'wd':keyword}
        r = requests.get(url,params=kv)
        print(r.request.url)
        r.raise_for_status()
        return ("爬取成功")
    except:
        print("爬取失败")

url="https://www.baidu.com/s"
keyword = "Python"
getHtmlSearch(url,keyword)

三、图片爬取

url.split("/")[-1] ：以 " / " 作为分隔吗，获取最后一个 " / " 后的字符串
使用 kesci 平台的话，是无法访问系统文件的，代码显示 “文件已经保存” 但是文件中不会有图片，使用 pycharm 就可以解决这个问题

# 图片爬取
import requests
import os
url = "http://www.nizhongyi.cn/resumeStatic/picture/zyCommunity.jpg"
root = "D://pics//"
path = root + url.split("/")[-1]
try:
    if not os.path.exists(root):
        os.mkdir(root)
    if not os.path.exists(path):
        r = requests.get(url)
        with open(path,'wb') as f:
            f.write(r.content)
            f.close()
            print("文件已经保存")
    else:
        print("文件已存在")
except:
    print("爬取失败")