Python爬虫（一）

最新推荐文章于 2024-04-01 13:30:49 发布

且听风吟zyw

最新推荐文章于 2024-04-01 13:30:49 发布

阅读量199

点赞数

分类专栏：爬虫

爬虫专栏收录该内容

6 篇文章 1 订阅

订阅专栏

python爬虫三流程：

获取网页：给一个网址发送请求，会返回整个网址的数据。
解析网页：从网页中提取你想要的数据。
存储数据：就是把你提取到的数据存储起来。

安装requests库：

打开cmd窗口
输入pip install requests
在这里插入图片描述

使用requests库获取页面：

import requests
link='https://blog.csdn.net/even160941'
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
r=requests.get(link,headers=headers)
print(r.text)

上面的代码获取了我博客网页的源代码，有两个注意点：

用user-agent伪装成浏览器访问
r.text是网页源代码
运行完代码，你会看到所有的网页代码，像下面这样：

提取数据：

接下来我们安装bs4库：

打开cmd窗口
输入pip install bs4

代码如下：

import requests
from bs4 import BeautifulSoup
link='https://blog.csdn.net/even160941'
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
r=requests.get(link,headers=headers)

soup=BeautifulSoup(r.text,'lxml')
date=soup.find('span',class_='date').text.strip()
print（date）

在这里插入图片描述
这里用了BeautifulSoup库对网页进行解析，首先先导入库，再把网页代码解析成BeautifulSoup库的形式，再用soup.find('span',class_='date').text.strip()找到我们需要的日期。

存储数据：

import requests
from bs4 import BeautifulSoup
link='https://blog.csdn.net/even160941'
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
r=requests.get(link,headers=headers)
soup=BeautifulSoup(r.text,'lxml')
date=soup.find('span',class_='date').text.strip()

with open('date.txt','a+')as t:
t.write(date)
t.close()

在这里插入图片描述

打开date.txt，日期已经写进去了。
在这里插入图片描述
注：本篇博客参照https://blog.csdn.net/weixin_42183408/article/details/87203499

且听风吟zyw

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Python爬虫（一）

python爬虫三流程：获取网页：给一个网址发送请求，会返回整个网址的数据。解析网页：从网页中提取你想要的数据。存储数据：就是把你提取到的数据存储起来。安装requests库：打开cmd窗口输入pip install requests使用requests库获取页面：import requestslink='https://blog.csdn.net/even160941'...
复制链接

扫一扫

专栏目录