强大的BeautifulSoup（1）

最新推荐文章于 2020-06-15 11:00:41 发布

weixin_43837855

最新推荐文章于 2020-06-15 11:00:41 发布

阅读量130

点赞数

本文链接：https://blog.csdn.net/weixin_43837855/article/details/106686115

版权

BeautifulSoup简介
BeautifulSoup是一个可以从HTML和XML文件中直接提取数据到Python库，BeautifulSoup支持Python标准库中的HTML解析器，还支持一些第三方的解析器，大多数的时候我们使用lxml，lxml的解析速度远快于HTML解析器。

BeautifulSoup安装

在cmd中输入pip install beautifulsoup4
安装第三方解析器lxml pip install lxml

BeautifulSoup使用
这是爬取百度新闻的一个html文件代码。

import requests
import time
import lxml
from bs4 import BeautifulSoup
user_agent='Mozilla / 5.0(WindowsNT10.0;WOW64)'
headers = {'User-Agent':user_agent}
time.sleep(2)
response = requests.get('http://news.baidu.com/',headers=headers)
txt = response.text
with open(r'html\\news.html','w') as wb:
    wb.write(txt)
soup = BeautifulSoup(open(r'html\\news.html'),'lxml')
print(soup.prettify())

得到一个BeautifulSoup对象后，一般通过BeautifulSoup类的基本元素来提取html中的内容
在这里插入图片描述
提取html中的信息

这是html文件的局部片段

#打印标题
print(soup.title)
#打印第一个标签
print(soup.meta)
#打印标签的名字
print(soup.meta.name)

查找元素
查找单个元素使用find，但是我们通常都是用find_all()方法查找元素，soup.find_all(self,name,attrs,recursive,text,limit)
name：对标签名称的检索字符串
attrs：对标签属性值的检索字符串，可标注属性检索
recursive：是否对子孙全部检索，默认True
text：<>…</>中字符串区域的检索字符串

#html文件的标题
title =soup.find('title').text
#打印所有meta标签
print(soup.find_all('meta'))
#打印所有列表里面所有标签
print(soup.find_all(['meta','a']))
for a in soup.find_all('a'):
    #打印网址，标题
    print(a.get('href'))
    print(a.text)

通过属性查找

print('打印class',soup.find_all(class_='a3')) 
print('打印mon',soup.find_all(mon="ct=1&a=1&c=top&pn=0"))

注释：指定属性，查找class属性为a的标签元素，注意因为class是python的关键字，所以这里需要加个下划线’_’

weixin_43837855

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
强大的BeautifulSoup（1）

BeautifulSoup简介BeautifulSoup是一个可以从HTML和XML文件中直接提取数据到Python库，BeautifulSoup支持Python标准库中的HTML解析器，还支持一些第三方的解析器，大多数的时候我们使用lxml，lxml的解析速度远快于HTML解析器。BeautifulSoup安装在cmd中输入pip install beautifulsoup4安装第三方解析器lxml pip install lxmlBeautifulSoup使用这是爬取百度新闻的一个ht
复制链接

扫一扫