爬虫关于BeautifulSoup库

最新推荐文章于 2024-08-08 14:28:23 发布

doudoudedi

最新推荐文章于 2024-08-08 14:28:23 发布

阅读量154

点赞数

分类专栏：数据爬取文章标签： python

本文链接：https://blog.csdn.net/qq_37433000/article/details/93870512

版权

数据爬取专栏收录该内容

4 篇文章 0 订阅

订阅专栏

这几天挺累的~~
就分享一些BeutifulSoup库的使用吧
本人写爬虫时经常遇到一些编码问题所以就写这一篇
beautifulSoup “美味的汤，绿色的浓汤”
一个灵活又方便的网页解析库，处理高效，支持多种解析器。
举一个列子（用别人的）

from bs4 import BeautifulSoup

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
soup = BeautifulSoup(html,'lxml')
print(soup.prettify())
print(soup.title)
print(soup.title.name)
print(soup.title.string)
print(soup.title.parent.name)
print(soup.p)
print(soup.p["class"])
print(soup.a)
print(soup.find_all('a'))
print(soup.find(id='link3'))

find函数很好用的比如通过tag的id属性搜索标签、通过tag的class属性搜索标签、通过字典的形式搜索标签内容返回的为一个列表、通过正则表达式匹配搜索等等
find_all(tag, attributes, recursive, text,limit, keywords)
# find_all(标签, 属性, 递归, 文本,限制查询数量, 关键字)
属性是个字典吧
find 相当于find_all(,limit=1)
所以在网上找了一个简单的列子

from urllib.request import urlopen
from bs4 import BeautifulSoup


url ='http://www.pythonscraping.com/pages/warandpeace.html'
html= urlopen(url) #抓取了该url网页
soup = BeautifulSoup(html) #使用BeautifulSoup对网页进行解析
name_list = soup.find_all("span",{'class': 'green'})#find_all抓取所有绿色字体，返回list
for name in name_list:
    print(name.get_text()) #get_text()函数剔除字符串中所有tag符号只保留tag中包含的文本

doudoudedi

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
爬虫关于BeautifulSoup库

这几天挺累的~~就分享一些BeutifulSoup库的使用吧本人写爬虫时经常遇到一些编码问题所以就写这一篇beautifulSoup “美味的汤，绿色的浓汤”一个灵活又方便的网页解析库，处理高效，支持多种解析器。举一个列子（用别人的）from bs4 import BeautifulSouphtml = '''<html><head><title&g...
复制链接

扫一扫

专栏目录