Python爬虫BeautifulSoup4小记

Beautiful Soup简介

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的网页信息提取库。Beautiful Soup可以自动将输入文档转换为Unicode编码,输出文档转换为utf-8编码。不需要考虑编码方式,除非文档没有指定一个编码方式,这时,Beautiful Soup就不能自动识别编码方式了。

BeautifulSoup安装

pip install lxml
pip install bs4

初识bs4

xml实例

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
html_str = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
'''
soup = BeautifulSoup(html_str,'lxml')

#print(soup)
# <html><head><title>The Dormouse's story</title></head>
# <body>
# <p class="title"><b>The Dormouse's story</b></p>
# <p class="story">Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>
# <p class="story">...</p>
# </body></html>

print(soup.prettify()) #漂亮打印
# <html>
#  <head>
#   <title>
#    The Dormouse's story
#   </title>
#  </head>
#  <body>
#   <p class="title">
#    <b>
#     The Dormouse's story
#    </b>
#   </p>
#   <p class="story">
#    Once upon a time there were three little sisters; and their names were
#    <a class="sister" href="http://example.com/elsie" id="link1">
#     Elsie
#    </a>
#    ,
#    <a class="sister" href="http://example.com/lacie" id="link2">
#     Lacie
#    </a>
#    and
#    <a class="sister" href="http://example.com/tillie" id="link3">
#     Tillie
#    </a>
#    ;
# and they lived at the bottom of a well.
#   </p>
#   <p class="story">
#    ...
#   </p>
#  </body>
# </html>

# coding=utf-8
#!/user/bin/env python
#@Time           : 2020/5/21 16:25
#@Author         : GodSpeed
#@File           : bs4_study.py
#@Software       : PyCharm

import requests
from bs4 import BeautifulSoup
import csv

# w_data='''<div class="article-content"><p>成语接龙是中华民族传统的文字游戏。它历史悠久,是传统文字、文化、文明的一个缩影,也是老少皆宜的民间文化娱乐活动。</p><p><br></p><p>今天特此精选2000个成语,“胸有成竹”作为开始,也作为结束,尽情感受汉字之美!</p><p><br></p><div class="pgc-img"><img src="http://p1.pstatp.com/large/pgc-image/4ae0e1096c67466daf97c315368eeaee" img_width="1000" img_height="682" alt="2000个成语,头尾接龙,妙绝了!(值得收藏)" inline="0"><p class="pgc-img-caption"></p></div><p><br></p><p><strong>胸有成竹</strong> + 竹报平安 + 安富尊荣</p><p>荣华富贵 + 贵而贱目 + 目无余子</p><p>子虚乌有 + 有目共睹 + 睹物思人</p><p>人中骐骥 + 骥子龙文 + 文质彬彬</p><p>彬彬有礼 + 礼贤下士 + 士饱马腾</p><p>腾云驾雾 + 雾里看花 + 花言巧语</p><p>语重心长 + 长此以往 + 往返徒劳</p><p>劳而无功 + 功成不居 + 居官守法</p><p>法外施仁 + 仁浆义粟 + 粟红贯朽</p><p>朽木死灰 + 灰飞烟灭 + 灭绝人性</p><p>性命交关 + 关门大吉 + 吉祥止止</p><p>止于至善 + 善贾而沽 + 沽名钓誉</p><p>誉不绝口 + 口蜜腹剑 + 剑戟森森</p><p>森罗万象 + 象箸玉杯 + 杯弓蛇影</p><p>影影绰绰 + 绰约多姿 + 姿意妄为</p><p>为人作嫁 + 嫁祸于人 + 人情冷暖</p><p>暖衣饱食 + 食不果腹 + 腹背之毛</p><p>毛手毛脚 + 脚踏实地 + 地老天荒</p><p>荒诞不经 + 经纬万端 + 端倪可察</p><p>察言观色 + 色若死灰 + 灰头土面</p><p>面有菜色 + 色授魂与 + 与民更始</p><p>始乱终弃 + 弃瑕录用 + 用舍行藏</p><p>藏垢纳污 + 污泥浊水 + 水乳交融</p><p>融会贯通 + 通宵达旦 + 旦种暮成</p><p>成人之美 + 美人迟暮 + 暮云春树</p><p>树大招风 + 风中之烛 + 烛照数计</p><p><br></p><div class="pgc-img"><img src="http://p1.pstatp.com/large/pgc-image/960251045b554af182214e0747214c68" img_width="900" img_height="500" alt="2000个成语,头尾接龙,妙绝了!(值得收藏)" inline="0"><p class="pgc-img-caption"></p></div><p><br></p><p>徙宅忘妻 + 妻儿老小 + 小本经营</p><p>营私舞弊 + 弊绝风清 + 清尘浊水</p><p>水磨工夫 + 夫唱妇随 + 随才器使</p><p>使贪使愚 + 愚昧无知 + 知书达礼</p><p>礼尚往来 + 来者不拒 + 拒谏饰非</p><p>非异人任 + 任人唯亲 + 亲密无间</p><p>间不容发 + 发指眦裂 + 裂土分茅</p><p>茅塞顿开 + 开路先锋 + 锋芒所向</p><p>向隅而泣 + 泣下如雨 + 雨丝风片</p><p>片言折狱 + 狱货非宝 + 宝山空回</p><p>回光返照 + 照本宣科 + 科班出身</p><p>身价百倍 + 倍日并行 + 行动坐卧</p><p>卧薪尝胆 + 胆破心寒 + 寒木春华</p><p>华不再扬 + 扬长而去 + 去粗取精</p><p>精诚团结 + 结党营私 + 私心杂念</p><p>念兹在兹 + 兹事体大 + 大势所趋</p><p>趋炎附势 + 势不两立 + 立此存照</p><p>照猫画虎 + 虎背熊腰 + 腰缠万贯</p><p>贯朽粟陈 + 陈词滥调 + 调嘴学舌</p><p>舌剑唇枪 + 枪林弹雨 + 雨过天青</p><p>青出于蓝 + 蓝田生玉 + 玉卮无当</p><p>当场出彩 + 彩凤随鸦 + 鸦雀无闻</p><p>闻风而起 + 起死回生 + 生拉硬扯</p><p>扯篷拉纤 + 纤芥之疾 + 疾风迅雷</p><p>雷打不动 + 动辄得咎 + 咎由自取</p><p>取辖投井 + 井井有条 + 条三窝四</p><p>四衢八街 + 街头巷尾 + 尾生之信</p><p>信口开河 + 河山带砺 + 砺山带河</p><p>河清难俟 + 俟河之清 + 清汤寡水</p><p>水滴石穿 + 穿云裂石 + 石沉大海</p><p>海立云垂 + 垂涎欲滴 + 滴水成冰</p><p>冰清玉洁 + 洁身自好 + 好肉剜疮</p><p>疮痍满目 + 目不识丁 + 丁公凿井</p><p>井中视星 + 星旗电戟 + 戟指怒目</p><p>目指气使 + 使羊将狼 + 狼心狗肺</p><p>肺石风清 + 清夜扪心 + 心织笔耕</p><p>耕当问奴 + 奴颜婢膝 + 膝痒搔背</p><p>背信弃义 + 义无反顾 + 顾全大局</p><p>局促不安 + 安步当车 + 车载斗量</p><p>量才而为 + 为渊驱鱼 + 鱼游釜中</p><p>中馈犹虚 + 虚有其表 + 表里如一</p><p>一呼百诺 + 诺诺连声 + 声罪致讨</p><p><br></p><div class="pgc-img"><img src="http://p1.pstatp.com/large/pgc-image/95303955b2ba42e9b0cc5b575780c857" img_width="900" img_height="500" alt="2000个成语,头尾接龙,妙绝了!(值得收藏)" inline="0"><p class="pgc-img-caption"></p></div><p><br></p><p>讨价还价 + 价增一顾 + 顾盼自雄</p><p>雄心壮志 + 志美行厉 + 厉兵秣马</p><p>马工枚速 + 速战速决 + 决一雌雄</p><p>雄才大略 + 略见一斑 + 斑驳陆离</p><p>离弦走板 + 板上钉钉 + 钉嘴铁舌</p><p>舌桥不下 + 下马看花 + 花样翻新</p><p>新陈代谢 + 谢天谢地 + 地久天长</p><p>长枕大被 + 被山带河 + 河落海干</p><p>干柴烈火 + 火上浇油 + 油腔滑调</p><p>调兵遣将 + 将伯之助 + 助人为乐</p><p>乐而不淫 + 淫词艳曲 + 曲终奏雅</p><p>雅俗共赏 + 赏罚分明 + 明刑不戮</p><p>戮力同心 + 心心相印 + 印累绶若</p><p>若有所失 + 失张失智 + 智圆行方</p><p>方枘圆凿 + 凿凿有据 + 据为己有</p><p>有眼无珠 + 珠光宝气 + 气味相投</p><p>投鼠忌器 + 器宇轩昂 + 昂首阔步</p><p>步履维艰 + 艰苦卓绝 + 绝少分甘</p><p>甘雨随车 + 车水马龙 + 龙飞凤舞</p><p>舞衫歌扇 + 扇枕温被 + 被发缨冠</p><p>冠冕堂皇 + 皇天后土 + 土阶茅屋</p><p>屋乌之爱 + 爱莫能助 + 助我张目</p><p>目挑心招 + 招风惹草 + 草率收兵</p><p>兵不雪刃 + 刃迎缕解 + 解衣推食</p><p>食古不化 + 化零为整 + 整装待发</p><p>发凡起例 + 例行公事 + 事必躬亲</p><p>亲如骨肉 + 肉跳心惊 + 惊弓之鸟</p><p>鸟枪换炮 + 炮凤烹龙 + 龙蛇飞动</p><p>动人心弦 + 弦外之音 + 音容笑貌</p><p>貌合心离 + 离心离德 + 德高望重</p><p>重蹈覆辙 + 辙乱旗靡 + 靡靡之音</p><p>音容宛在 + 在所难免 + 免开尊口</p><p>口耳之学 + 学而不厌 + 厌难折冲</p><p>冲口而出 + 出谷迁乔 + 乔龙画虎</p><p>虎踞龙盘 + 盘马弯弓 + 弓折刀尽</p><p>尽善尽美 + 美意延年 + 年高望重</p><p>重温旧梦 + 梦寐以求 + 求全之毁</p><p>毁家纾难 + 难言之隐 + 隐恶扬善</p><p>善始善终 + 终南捷径 + 径情直行</p><p>行成于思 + 思潮起伏 + 伏低做小</p><p>小恩小惠 + 惠而不费 + 费尽心机</p><p>机关算尽 + 尽忠报国 + 国士无双</p><p>双宿双飞 + 飞灾横祸 + 祸从天降</p><p>降格以求 + 求同存异 + 异名同实</p><p>实至名归 + 归真反璞 + 璞玉浑金</p><p>金玉锦绣 + 绣花枕头 + 头没杯案</p><p>案牍劳形 + 形单影只 + 只字不提</p><p>提心吊胆 + 胆大心细 + 细枝末节</p><p>节用裕民 + 民脂民膏 + 膏唇试舌</p><p>舌锋如火 + 火伞高张 + 张冠李戴</p><p>戴月披星 + 星移斗转 + 转祸为福</p><p>福至心灵 + 灵丹圣药 + 药笼中物</p><p>物以类聚 + 聚蚊成雷 + 雷厉风行</p><p>行将就木 + 木本水源 + 源源不断</p><p><br></div>
# '''
# soup = BeautifulSoup(w_data,features='lxml')
# print(soup)
# # SyntaxError: Non-UTF-8 code starting with '\xef'  解决方法文件第一个行加入 # coding=utf-8
# print(soup.prettify()) #漂亮的输出

#soup导航

#print(type(soup.title)) # <class 'bs4.element.Tag'>
#print(soup.title) #<title>The Dormouse's story</title>
#print(soup.head.title) #<title>The Dormouse's story</title>
#print(soup.head.title) #<title>The Dormouse's story</title>
#print(soup.p)#<p class="title"><b>The Dormouse's story</b></p>

#标签的名称
#print(soup.p.name)#p
#标签的内容
#print(soup.p.string) #The Dormouse's story
#print(soup.p.text) #The Dormouse's story

#查找所有的标签

r = soup.find_all('p')
#print(type(r)) # <class 'bs4.element.ResultSet'>
#print(len(r)) # 3
for p_tag in r:
    #print(type(p_tag)) #<class 'bs4.element.Tag'>
    print(p_tag)

#找a标签的属性href内容
a_hrefs = soup.find('p',class_='story').find_all('a')
for a_href in a_hrefs:
    #print(a_href['href'])
    print(a_href.get('href'))
    # http://example.com/elsie
    # http://example.com/lacie
    # http://example.com/tillie

bs4的对象种类

tag : 标签
NavigableString : 可导航的字符串
BeautifulSoup : bs对象
Comment : 注释

soup = BeautifulSoup(html_str,'lxml')
#bs对象
#print(type(soup)) #<class 'bs4.BeautifulSoup'>


#tag对象
#print(type(soup.title)) #<class 'bs4.element.Tag'>

#print(soup.p.string) #The Dormouse's story
# NavigableString 可导航的字符串(字符串是标签导航过来的)
#print(type(soup.p.string))#<class 'bs4.element.NavigableString'>

#Comment
#html注释
#<!--这是单行注释-->
html_conment = '<s><!--这是单行注释--></s>'
soup_conment = BeautifulSoup(html_conment,'lxml')
print(soup_conment.s.string) #这是单行注释

print(type(soup_conment.s.string)) #<class 'bs4.element.Comment'>

遍历树&遍历子节点

bs⾥⾯有三种情况,第⼀个是遍历,第⼆个是查找,第三个是修改

contents children descendants

  • contents 返回的是⼀个列表
  • children 返回的是⼀个迭代器通过这个迭代器可以进⾏迭代
  • descendants 返回的是⼀个⽣成器遍历⼦⼦孙孙
from bs4 import BeautifulSoup

html_str = """
<html><head><title>head The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_str,'lxml')
#tag
# print(soup.p) #<p class="title"><b>The Dormouse's story</b></p>
# print(soup.a) # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
# print(soup.b) # <b>The Dormouse's story</b>

#寻找所有的p标签
# p_tag_all =  soup.find_all('p')
# print(type(p_tag_all)) # <class 'bs4.element.ResultSet'>
# print(p_tag_all)
#[<p class="title"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>, <p class="story">...</p>]

#[]属性来取值
p_class = soup.p['class']

# print(type(p_class)) # <class 'list'>
# print(p_class) #['title']

# - descendants
# 返回的是⼀个⽣成器遍历⼦⼦孙孙

for des in soup.descendants:
    print('----------------------------')
    print(des)


html_str ='''
<html>

<body>

<h4>一个无序列表:</h4>
<ul>
  <li>咖啡</li>
  <li>茶</li>
  <li>牛奶</li>
  <li>饮料</li>
  <li>水果</li>
</ul>

</body>
</html>
'''

# - contents
# 返回的是⼀个列表
soup2 = BeautifulSoup(html_str,'lxml')
#print(type(soup2.contents)) #<class 'list'>
#print(links)

# - children
# 返回的是⼀个迭代器通过这个迭代器可以进⾏迭代
#print(soup2.ul.children) #<list_iterator object at 0x000001A536D253A0>

# for li in soup2.ul.children:
#     #print(type(li))
#     print(li)
#     #print(li.string)

.string .strings .stripped strings

string获取标签⾥⾯的内容


from bs4 import BeautifulSoup

html_str = """
<html><head><title>head The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_str,'lxml')

但是无法通过string获取多个子标签内容

print(soup.find('html').string) #None

strings 返回是⼀个⽣成器对象⽤过来获取多个标签内容

html_strings = soup.find('html').strings

print(type(html_strings)) #<class 'generator'>

for html_string in html_strings:
    #print(type(html_string)) #<class 'bs4.element.NavigableString'>
    print(html_string)#结果可以打印子标签的所有内容,而且带空格

stripped strings 和strings基本⼀致 但是它可以把多余的空格去掉

html_stripped_strings = soup.find('html').stripped_strings

print(type(html_stripped_strings)) #<class 'generator'>

for html_stripped_string in html_stripped_strings:
    #print(type(html_stripped_string)) #<class 'str'>
    print(html_stripped_string)#结果可以打印子标签的所有内容,并且可以去掉多余空格
    #head The Dormouse's story
    # The Dormouse's story
    # Once upon a time there were three little sisters; and their names were
    # Elsie
    # ,
    # Lacie
    # and
    # Tillie
    # ;
    # and they lived at the bottom of a well.
    # ...

遍历树 遍历父节点

parent 和 parents

  • parent直接获得⽗节点

from bs4 import BeautifulSoup

html_str = """
<html><head><title>head The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_str,'lxml')

a_tag = soup.find('a')
#print(a_tag) #<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

print(a_tag.parent)
#<p class="story">Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
  • parents获取所有的⽗节点
#print(type(a_tag.parents)) #<class 'generator'>

for a_parent in a_tag.parents:
    print(type(a_parent))
    print(a_parent)

遍历树 遍历兄弟结点

next_sibling 下⼀个兄弟结点
previous_sibling 上⼀个兄弟结点
next_siblings 下⼀个所有兄弟结点
previous_siblings上⼀个所有兄弟结点

import re
html_str = '''
<bookstore>
    <book1>
        <Index>1</Index>
        <title lang="中文">指环王</title>
        <author>约翰·罗纳德·瑞尔·托尔金</author>
        <year>1954年至1955年出版</year>
        <price>60.00</price>
    </book1>
    <book2>
        <Index>2</Index>
        <title lang="中文">哈利波特</title>
        <author>J.K.罗琳</author>
        <year>1997~2007</year>
        <price>120.00</price>
    </book2>
    <book3>
        <Index>3</Index>
        <title lang="中文">复仇者联盟</title>
        <author>美国漫威漫画旗下超级英雄团队</author>
        <year>1963年9月</year>
        <price>150.00</price>
    </book3>
</bookstore>
'''

#必须要进行去处理才能正确
html_str = re.sub(r'[\n ]','',html_str)

soup = BeautifulSoup(html_str,'lxml')
#print(soup.prettify())
book1_tag = soup.book1
#print(book1_tag)
#<book1><index>1</index><titlelang>指环王<author>约翰·罗纳德·瑞尔·托尔金</author><year>1954年至1955年出版</year><price>60.00</price></titlelang></book1>


# next_sibling 下⼀个兄弟结点
#print(book1_tag.next_sibling)
#<book2><index>2</index><titlelang>哈利波特<author>J.K.罗琳</author><year>1997~2007</year><price>120.00</price></titlelang></book2>


# previous_sibling 上⼀个兄弟结点
#print(book1_tag.previous_sibling)
#None

book3_tag = soup.book3
#print(book3_tag.previous_sibling)
#<book2><index>2</index><titlelang>哈利波特<author>J.K.罗琳</author><year>1997~2007</year><price>120.00</price></titlelang></book2>


# next_siblings 下⼀个所有兄弟结点
#print(book1_tag.next_siblings) #<generator object PageElement.previous_siblings at 0x000001FD28F77350>

#for sibling in book1_tag.next_siblings:
    #print(sibling)
    #<book2><index>2</index><titlelang>哈利波特<author>J.K.罗琳</author><year>1997~2007</year><price>120.00</price></titlelang></book2>
    #<book3><index>3</index><titlelang>复仇者联盟<author>美国漫威漫画旗下超级英雄团队</author><year>1963年9月</year><price>150.00</price></titlelang></book3>


#  previous_siblings上⼀个所有兄弟结点
for sibling in book3_tag.previous_siblings:
     print(sibling)
     #<book2><index>2</index><titlelang>哈利波特<author>J.K.罗琳</author><year>1997~2007</year><price>120.00</price></titlelang></book2>
     #<book1><index>1</index><titlelang>指环王<author>约翰·罗纳德·瑞尔·托尔金</author><year>1954年至1955年出版</year><price>60.00</price></titlelang></book1>

fand函数

名称说明
find_all()以列表形式返回所有的搜索到的标签数据
find()返回搜索到的第⼀条数据
html_str = """
<html><head><title>head The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_str,'lxml')

# find_all(self, name=None, attrs={}, recursive=True, text=None,
#                  limit=None, **kwargs)

# name : tag 名称
# attrs :标签的属性
# recursive : 是否递归
# text : 文本内容
# limit : 限制返回的条数
# **kwargs :不定长参数 以关键字来传参

tag_p = soup.find_all('p',class_='title')
#print(tag_p)
#[<p class="title"><b>The Dormouse's story</b></p>]
tag_p_s = soup.find_all('p',class_='story',recursive=True)
#print(tag_p_s)
#[<p class="story">Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>, <p class="story">...</p>]
tag_p_s = soup.find_all('p',class_='story',recursive=False)
#print(tag_p_s)
#[]
tag_p_s = soup.find_all('p',class_='story',recursive=True,text='...')
#print(tag_p_s)
#[<p class="story">...</p>]

tag_a = soup.find_all('a',limit=1)[0]
#print(tag_a)
#<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
#print(soup.a)
#<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
print(soup.find('a'))
#<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

名称说明
find_parents()搜索所有父节点
find_parrent()搜索单个父节点
find_next_siblings()搜索所有兄弟节点
find_next_sibling()搜索单个兄弟
#print(soup.title.find_parent('head'))
#<head><title>headTheDormouse'sstory</title></head>
#print(soup.find(text='Lacie').find_parents('p'))

#print(soup.find(text='Lacie').find_parent('p'))
#<p class="story">Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>

#print(soup.find(text='Lacie').find_next_sibling('a'))
#None
#print(soup.a.find_next_sibling('a'))
#<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

for a_sbinling in soup.a.find_next_siblings('a'):
     print(a_sbinling)
     #<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
     #<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
名称说明
find_previous_siblings()往上搜索所有兄弟
find_previous_sibling()往上搜索单个兄弟
tag_a_t = soup.find('a', text='Tillie')

for tag_a_t_p_s in tag_a_t.find_previous_siblings():
     print(tag_a_t_p_s)
     #<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
     # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
tag_a = soup.find('a', text='Lacie')
print(tag_a.find_previous_sibling())
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
名称说明
find_all_next()往下搜索所有元素
find_next()往下查找单个元素
tag_a = soup.find('a')
#print(tag_a.find_all_next())
#[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, <p class="story">...</p>]

print(tag_a.find_next())
#<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

修改⽂档树

  • 修改tag的名称和属性
  • 修改string 属性赋值,就相当于⽤当前的内容替代了原来的内容
  • append() 像tag中添加内容,就好像Python的列表的 .append() ⽅法
  • decompose() 修改删除段落,对于⼀些没有必要的⽂章段落我们可以给他删除掉
    45:46

练习

#!/user/bin/env python
#-*-coding utf-8-*-
#@Time           : 2020/5/269:10
#@Author         : GodSpeed
#@File           : 爬取中国天气网.py
#@Software       : PyCharm

from bs4 import BeautifulSoup
import requests
import html5lib
#需求: 爬取中国天气网的日期和城市和最高气温和最低气温

# 华北
# http://www.weather.com.cn/textFC/hb.shtml

# 东北
# http://www.weather.com.cn/textFC/db.shtml

# 华东
# http://www.weather.com.cn/textFC/hd.shtml

# 华中
# http://www.weather.com.cn/textFC/hz.shtml

# 华南
# http://www.weather.com.cn/textFC/hn.shtml

# 西北
# http://www.weather.com.cn/textFC/xb.shtml

# 西南
# http://www.weather.com.cn/textFC/xn.shtml

# 港澳台
# http://www.weather.com.cn/textFC/gat.shtml

url = 'http://www.weather.com.cn/textFC/{}.shtml'
area_list =['hb','db','hd','hz','hn','xb','xn','gat']

def get_data(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36 Core/1.63.6788.400 QQBrowser/10.3.2816.400'
    }
    req = requests.get(url,headers=headers)
    req.encoding = 'utf-8'
    #print(req.text)
    #soup = BeautifulSoup(req.text,'lxml')
    #soup = BeautifulSoup(req.text,'html.parser')
    soup = BeautifulSoup(req.text, 'html5lib')
    #1.获取当日的最大的表格节点
    conMidtab = soup.find('div',class_="conMidtab")
    #print(type(conMidtab)) #<class 'bs4.element.Tag'>
    # 获取日期
    date = conMidtab.find_all('td')[2]
    print(date.string[:-2])

    #2.找每个省市Provinces的table
    Provinces_tables = conMidtab.find_all('table')

    for Province_table in Provinces_tables:

        # 3.遍历省市,过滤前两个标签tr,提取具体的城市信息
        #print(Province_table.find_all('tr')[2:])
        citys_info = Province_table.find_all('tr')[2:]
        list_city_info = []
        for city_info in citys_info:
            #print(city_info)
            #城市
            #city_name = city_info.find('td',height='23').text
            #city_name = city_info.find_all('td')[-8].text.replace('\n','')
            city_name = list(city_info.find('td').stripped_strings)[0]
            #print(city_name)
            #ht = city_info.find('td',width="92").text
            ht = city_info.find_all('td')[-5].text
            #print(ht)
            lt = city_info.find_all('td')[-2].text
            #print(lt)
            list_city_info.append([city_name,ht,lt])

            #break
        print(list_city_info)





if __name__ == '__main__':

    for area in area_list:
        url_html = url.format(area)
        #print(url_html)
        get_data(url_html)
        #break


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Narutolxy

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值