python爬虫基础用法

最新推荐文章于 2024-09-14 17:35:17 发布

慵c

最新推荐文章于 2024-09-14 17:35:17 发布

阅读量281

点赞数

分类专栏：爬虫文章标签： python 爬虫

本文链接：https://blog.csdn.net/qq_38700625/article/details/121136985

版权

爬虫专栏收录该内容

2 篇文章 0 订阅

订阅专栏

前言：利用python当中的request是库来获取相关网站内容

1 requests库

在python当中用于网络爬虫的库是有很多的，简单介绍如下

请求库	解析库	存储库	框架
urllib	beautifulsoup	pymysql	Scrapy
requests	pyquery	pymongo	Crawley
selenium	lxml	redisdump	Portia
aiohttp	tesserocr		newspaper
			python-goose
			cola

requests库是基于urllib库，在使用上较简单，便捷，上手友好。

2 使用方法

1）requests库安装：pip install requests
2）明确请求的url，需要的参数params（以字典的形式表示），headers（请求的头）
3）通过requests.get()来获得请求的内容，response.encoding用来设置请求内容的编码方式，response.text 以字符串的形式来输出请求内容，response.content以字节流的形式来输出请求内容。

3 应用案例

古诗文网的诗人数据爬取：

import requests
import json
import pandas as pd

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36',"Cookie": "wxopenid=defoaltid; Hm_lvt_9007fab6814e892d3020a64454da5a55=1635823892,1635837412,1635837707,1635838370; Hm_lpvt_9007fab6814e892d3020a64454da5a55=1635838946"}
page = 1
for i in range(1,21):
    url = f"https://app.gushiwen.cn/api/author/Default10.aspx?c=%E4%B8%8D%E9%99%90&page={i}&token=gswapi"
    res = requests.get(url, headers=headers)
    res.encoding = 'utf-8'
    # 请求的为json数据格式，通过json.loads来进行解析
    s = json.loads(res.text)['authors']
    for item in s:
    	print(item)
    	#利用requests请求图片数据，并存储
    	url = 'https://song.gushiwen.cn/authorImg/' + item['pic'] + '.jpg'
    	r = requests.request('get',url)
    	with open(path+item['pic'] + '.jpg','wb') as f:
            f.write(r.content)
        f.close()