爬虫简介和请求模块urllib，requests

最新推荐文章于 2024-02-03 23:14:23 发布

3S水乡

最新推荐文章于 2024-02-03 23:14:23 发布

阅读量145

点赞数

分类专栏： Python爬虫文章标签： python 爬虫

本文链接：https://blog.csdn.net/weixin_46002243/article/details/112059137

版权

Python爬虫专栏收录该内容

1 篇文章 0 订阅

订阅专栏

爬虫简介和请求模块urllib，requests

1. 爬虫简介

什么是爬虫？

简单一句话就是代替人去模拟浏览器进行网页操作

为什么需要爬虫？

提供数据源，列如一些搜索引擎就是先去网站爬取信息，再形成一个返回的结果画面呈现给用户
爬取数据进行数据分析
AI人工只能（智能家居、无人驾驶、智能语音、智能导航、人脸识别。。。）

企业获取数据的方式？

公司自有数据
第三方平台购买的数据（百度指数、数据堂等）
爬虫爬取的数据

2. python爬虫大致分为三步，数据爬过来，数据提取，数据分析。

3. 网络request 的几个概念

get查询参数会在url显示出来
post查询参数不会显示再url地址之上的
url：
User-Agent 用户代理，记录了用户的操作系统、浏览器等，为了让用户更好的获取HTML页面效果

User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36

Referer 表明当前的这个请求是从哪个url过来了，一般作为一个反爬的工具。

4. urllib

urllib 是内置的模块，而requests 是三方的模块
python2: urllib2， urllib
python3: 把urllib和urllib2合并统一的urllib。
还有urllib3模块，接触不多
urllib.request.urlopen()不能自定义header。如果一个request需要自定义header，用 urllib.request.Request() 自定义一个request对象，再把这个对象传给urlopen()

req = urllib.request.Request(url, headers=headers)
res = urllib.request.urlopen(req)

get类型的url 中如果包含中文，那么中文字符是以16进制的形式保存的
urllib.parse 中包含两个方法，

urllib.parse.urlencode(“字典”)
urllib.parse.quote(“字符串”)

urllib.parse.urlencode({"kw": "海贼王"})

Output: kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B

urllib.parse.quote("海贼王")

Output: %E6%B5%B7%E8%B4%BC%E7%8E%8B
海贼王贴吧第2页URL:

https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=50

Note: 对于中文字符，三个16进制表示一个字符。比如：海=%E6%B5%B7

5. requests

requests.get() 能自定义header
res = requests.get(url, headers=headers)
type(res.content) = <class ‘bytes’>， type(res.text) = <class ‘str’>
POST 请求， url中看不到查询参数

requests.get(url, data=date, headers=headers)

date 里面需要包含查询参数

6. 网上爬取一张图片，并保存到本地

urllib 实现

import urllib.request
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"}
url = "https://ss1.bdstatic.com/70cFvXSh_Q1YnxGkpoWK1HF6hhy/it/u=4001356234,2763706243&fm=26&gp=0.jpg"

req = urllib.request.Request(url, headers=headers)
res = urllib.request.urlopen(req)
with open("tupian.jpg", 'wb') as pic:
    pic.write(res.read())

requests实现

import requests
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"}
url = "https://ss2.bdstatic.com/70cFvnSh_Q1YnxGkpoWK1HF6hhy/it/u=3323914398,641435642&fm=26&gp=0.jpg"

res = requests.get(url, headers=headers)
with open("dahua.jpg", "wb") as pic:
    pic.write(res.content)

7. 爬取百度贴吧数据，并保存到本地

基本思路：获取贴吧每页的url，然后把每页数据都保存到本地。

输入想要查找的贴吧 2. 输入想爬取的起始页 3. 输入想要爬取的终止页 4. 爬取数据并保存在本地

urllib 实现

import urllib.parse
import urllib.request

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"}

kw_input = input("pls input keyword: ")
start_pn = int(input("Enter the first page: "))
end_pn = int(input("Enter the last page: "))

kw_input = urllib.parse.quote(kw_input)
# print(kw, type(kw))

#第二页  https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=50
#第三页  https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B&ie=utf-8&pn=100

base_url = "https://tieba.baidu.com/f?kw={kw_input}&pn={page}"

for page in range(start_pn, end_pn+1):

    url = base_url.format(kw_input=kw_input, page=(page-1)*50)
    req = urllib.request.Request(url, headers=headers)
    res = urllib.request.urlopen(req)
    with open(f"第{page}页.html", 'w', encoding='utf-8', newline='') as f:
        content = res.read().decode('utf-8')
        f.write(content)

requestes 实现

import requests

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"}

kw_input = input("pls input keyword: ")
start_pn = int(input("Enter the first page: "))
end_pn = int(input("Enter the last page: "))

base_url = "https://tieba.baidu.com/f?kw={kw_input}&pn={page}"

for page in range(start_pn, end_pn+1):

    url = base_url.format(kw_input=kw_input, page=(page-1)*50)
    res = requests.get(url, headers=headers)
    res.encoding = 'utf-8'
    print(res.status_code)
    with open(f"第{page}页.html", 'w', encoding='utf-8', newline='') as f:
        f.write(res.text)

3S水乡

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
爬虫简介和请求模块urllib，requests

爬虫简介和请求模块urllib，requests1. 爬虫简介什么是爬虫？为什么需要爬虫？企业获取数据的方式？2. python爬虫大致分为三步，数据爬过来，数据提取，数据分析。3. 网络request 的几个概念4. urllib5. requests6. 网上爬取一张图片，并保存到本地爬取百度贴吧数据，并保存到本地1. 爬虫简介什么是爬虫？简单一句话就是代替人去模拟浏览器进行网页操作为什么需要爬虫？提供数据源，列如一些搜索引擎就是先去网站爬取信息，再形成一个返回的结果画面呈现给用户爬
复制链接

扫一扫