header python 环境信息_Python爬虫实践(6)--BeautifulSoup的基础操作

最新推荐文章于 2022-03-05 19:14:30 发布

Problem Solver

最新推荐文章于 2022-03-05 19:14:30 发布

阅读量128

点赞数

文章标签： header python 环境信息

本文链接：https://blog.csdn.net/weixin_42715262/article/details/112360431

版权

本期为python爬虫实践的第六节，传送门：

Python爬虫实践(1)--大数据时代的数据挑战

Python爬虫实践(2)--非结构化数据与爬虫

Python爬虫实践(3)--了解网络爬虫背后的秘密

Python爬虫实践(4)--编写第一个网络爬虫程序

Python爬虫实践(5)--使用BeautifulSoup解析网页元素

在前一期的教程中，我们已经可以使用BeautifulSoup将多余的html标签剔除，但是有的时候，这样做并不能满足我们的要求。假设现在我们想要获取网页的图片，那么我们想要获取的信息恰好在被剔除的html中。那么我们该怎么做呢？本期教程，我们说一下BeautifulSoup的基础操作，学会这些基础操作，就可以灵活的提取自己想要的信息。

html

我们先回看一下，上一期的代码。

import requestsfrom bs4 import BeautifulSoupres = requests.get('https://tech.sina.com.cn/')res.encoding='utf-8'soup=BeautifulSoup(res.text,'html.parser')print(soup.text)

接着我们可以使用soup的select方法将特定标签的元素取出来，接下来，我们还是以实际的案例做一个演示。

csdn帖子

在这个csdn的页面中，我们使用开发者工具查看源代码，标题是存放在H1标签中的，下面我们可以这样写代码，直接的将标题提取出来。

import requestsfrom bs4 import BeautifulSoupres = requests.get('https://blog.csdn.net/qq_36119192/article/details/82079150')res.encoding = 'utf-8'soup = BeautifulSoup(res.text, 'html.parser')header = soup.select('h1')print(header)

提取结果：

提取结果

观察提取结果，和我们最初的想法还是有的区别的，通过修改代码，我们确实已经把h1标签单独提取出来了，但是页面中同时存在好几个h1标签。这就导致我们的提取结果并不理想，我们通过观察提取结果，可以发现，这些被提取出来的h1标签，他们的class属性不一样，那么我们是否可以根据这一点，进一步将他们区别开呢？答案是肯定的，我继续修改代码。

import requestsfrom bs4 import BeautifulSoupres = requests.get('https://blog.csdn.net/qq_36119192/article/details/82079150')res.encoding = 'utf-8'soup = BeautifulSoup(res.text, 'html.parser')header = soup.select('.title-article')print(header)

提取结果

现在我们回答文章开头的那个问题，我们如何将图片的链接提取出来，我们可以这样编写代码。

import requestsfrom bs4 import BeautifulSoupres = requests.get('https://blog.csdn.net/qq_36119192/article/details/82079150')res.encoding = 'utf-8'soup = BeautifulSoup(res.text, 'html.parser')images = soup.select('img')for image in images: print(image['src'])

代码运行结果如下图，有了这些图片的链接地址，我们就可以使用python将他们保存到我们自己的电脑上。