python爬网页标题与网址

最新推荐文章于 2024-06-20 05:30:14 发布

菜鸟驿站2020

最新推荐文章于 2024-06-20 05:30:14 发布

阅读量1.2k

点赞数

分类专栏： python

本文链接：https://blog.csdn.net/weixin_38946164/article/details/107003370

版权

python 专栏收录该内容

14 篇文章 0 订阅

订阅专栏

先上效果图：
在这里插入图片描述

全部代码：

import requests
from bs4 import BeautifulSoup

req = requests.get('https://www.geekdigging.com/')
html = req.text

bf = BeautifulSoup(html,'lxml') #用lxml解析器格式化文本
texts = bf.find_all('h2',class_='item-title')   #提取所有<h2 class="item-title">标签的内容

a_bf = BeautifulSoup(str(texts),'lxml') #用lxml解析器格式化文本
a = a_bf.find_all('a')  #在上面数据的基础上再提取所有<a>标签的内容

list_a=[]
for aa in a:    
    list_a.append(((aa.string).replace(' ','') ,aa.get('href'))) #提取标题并去除空格+提取网址，追加到列表

list_a.sort()   #对列表进行排序.升序
for aaa in list_a:
    print(aaa[0],aaa[1])

以下分别用图片说明代码实现的功能

在这里插入图片描述

实现代码：

bf = BeautifulSoup(html,'lxml') #用lxml解析器格式化文本
texts = bf.find_all('h2',class_='item-title')   #提取所有<h2 class="item-title">标签的内容

------------------------------------------------------------------------------------------------------------------------------------
在这里插入图片描述
实现代码：

a_bf = BeautifulSoup(str(texts),'lxml') #用lxml解析器格式化文本
a = a_bf.find_all('a')  #在上面数据的基础上再提取所有<a>标签的内容

------------------------------------------------------------------------------------------------------------------------------------

在这里插入图片描述
从上图看到存在许多空格，并不是想要的格式，我要除掉所有空格。

list_a=[]
for aa in a:    
    list_a.append(((aa.string).replace(' ','') ,aa.get('href'))) #提取标题并去除空格+提取网址，追加到列表

在这里插入图片描述
到了这里，格式是我想的，但排序还是不满意，原谅我是处女座，继续优化

list_a.sort()   #对列表进行排序.升序
for aaa in list_a:
    print(aaa[0],aaa[1])

在这里插入图片描述
最后的效果，看起来舒服好多了

菜鸟驿站2020

关注

0
点赞
踩
9

收藏

觉得还不错? 一键收藏
0
评论
python爬网页标题与网址

先上效果图：全部代码：import requestsfrom bs4 import BeautifulSoupreq = requests.get('https://www.geekdigging.com/')html = req.textbf = BeautifulSoup(html,'lxml') #用lxml解析器格式化文本texts = bf.find_all('h2',class_='item-title') #提取所有<h2 class="item-title"&
复制链接

扫一扫

专栏目录