Python使用BeautifulSoup

最新推荐文章于 2024-07-10 17:28:32 发布

panda_225400

最新推荐文章于 2024-07-10 17:28:32 发布

阅读量1.6k

点赞数 2

分类专栏： Python Other Ubuntu 文章标签： python 开发语言后端

本文链接：https://blog.csdn.net/panda_225400/article/details/121088116

版权

Other 同时被 3 个专栏收录

44 篇文章 1 订阅

订阅专栏

Ubuntu

33 篇文章 0 订阅

订阅专栏

Python

30 篇文章 0 订阅

订阅专栏

文章目录

前言
一、BeautifulSoup是什么？
二、如何使用
- 1.引入库
- 2.解析方式
总结

前言

在没接触Python之前使用正则表达式来解析网页，如果一个正则匹配稍有差池，那可能程序就处在永久的循环之中，但是在Python中有工具叫BeautifulSoup，有了它我们可以很方便地提取出 HTML 或 XML 标签中的内容，实在是方便

提示：以下是本篇文章正文内容，下面案例可供参考

一、BeautifulSoup是什么？

BeautifulSoup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序。

二、如何使用

1.引入库

代码如下（示例）：

pip install beautifulsoup4
pip3 install lxml

2.解析方式

代码如下（示例）：

# coding:utf-8
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; InfoPath.3)'}
r = requests.get("http://www.samr.gov.cn/zw/",headers=headers)
soup = BeautifulSoup(r.content,"lxml")
#print(soup)
#print(soup.title)
#print(soup.li)  #返回html中的第一li元素
# 如果有多个则会返回None
#print(soup.li.string)
#print(soup.li.contents)
#print(soup.li.children)
'''
for i in soup.li.children:
        print(i.string)
'''
#items = soup.select("body > div .saictopbox > div .mainShareDiv_24 > div > div > a" )
#items = soup.select("body > div .saictopbox > div.share.topshare > div.mainShareDiv_24 > div" )
'''
for i in items:
    print(i)

items = soup.select_one("body > div .saictopbox > div .mainShareDiv_24 > div > div > a" )
print(items)  #返回打印一条匹配的数据
'''
#items = soup.find_all("a",attrs={"href":"#"})
items = soup.find("a",attrs={"href":"#"})
print(items)

select 多class 为什么要这样写，不明白
代码如下（示例）：

items = soup.select("body > div .saictopbox > div.share.topshare > div.mainShareDiv_24 > div" )

下面这样写也行
代码如下（示例）：

# coding:utf-8
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; InfoPath.3)'}
url="https://www.qq.com/"
resp = requests.get(url,headers=headers)
#print(resp.encoding)
#wbdata = resp.text
wbdata = resp.text
soup = BeautifulSoup(wbdata,"lxml")
#print(soup)
#news_titles = soup.select("div .detail > h3 > a[target='_blank']")
news_titles = soup.select("div.bd > ul.news-list > li > a[target='_blank']")
for i in news_titles:
    title = i.get_text()  #获取标签的文本信息
    link = i.get("href")  #获取标签的属性值
    #打印字典
    data = {"标题":title,"链接":link}
    print(data)

在这里插入图片描述

总结

记录点点滴滴

panda_225400

关注

2
点赞
踩
6

收藏

觉得还不错? 一键收藏
打赏
0
评论
Python使用BeautifulSoup

文章目录前言一、BeautifulSoup是什么？二、如何使用1.引入库2.解析方式总结前言在没接触Python之前使用正则表达式来解析网页，如果一个正则匹配稍有差池，那可能程序就处在永久的循环之中，但是在Python中有工具叫BeautifulSoup，有了它我们可以很方便地提取出 HTML 或 XML 标签中的内容，实在是方便提示：以下是本篇文章正文内容，下面案例可供参考一、BeautifulSoup是什么？BeautifulSoup提供一些简单的、python式的函数用来处理导航、搜索、
复制链接

扫一扫