beautifulsoup html内容_用python进行网络爬虫——chapter2：高级HTML解析

最新推荐文章于 2022-09-27 16:37:48 发布

weixin_39901213

最新推荐文章于 2022-09-27 16:37:48 发布

阅读量80

点赞数

文章标签： beautifulsoup html内容 html所有attribute python html解析

v2-55e0602a3d5678d7fbc986cdc458fb5d_1440w.jpg?source=172ae18b

from urllib.request import urlopen
from bs4 import BeautifulSoup
html=urlopen("http://www.pythonscraping.com/pages/warandpeace.html")
bs=BeautifulSoup(html.read(),"html.parser")
namelist=bs.find_all("span",{'class':'green'})
for name in namelist:
    print(name.get_text())

bs.find_all("span",class_="green")用于将网页中tag为span（tagName），且class（tagAttribute）为green的内容都按顺序找出来；

name.get_text()用于将内容从标签里面分离出来，如果直接打印name的话内容会带有标签。一般到最后才会使用到；

find_all可以用很多参数：find_all(tag, attributes, recursive, text, limit, keywords)；

tag:.find_all(["h1","h2","h3"])；

attributes:.find_all("span",{"class":"green","red"}；

recursive:如果为True，会一直向下查找；为False则只看top-level的tag。默认为True；

text：可输入要找的text，根据text来匹配。可用于text出现频次的统计；

limit：设定要输出的个数，.find()的limit为1；

keywords：设定attribute要包含的关键词，如class_="exa"表示class属性中包含exa；bs.find_all(id='text')和bs.find_all('', {'id':'text'})是一样的；

四种beautifulsoup对象：

前面的bs
tag对象：bs.div
NavigableString对象
comment对象

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html, 'html.parser')
for child in bs.find("table",{"id":"giftList"}).children:
    print(child)

.children表示只找某个tag下面的第一个descendanttag，而不加的话会找出所有descendant并单独成为一项；

比如这里table这个tag的children应该是tr，所以后面print的话只把<tr>和</tr>包含的东西打印出来；

如果用descendants的话，所有再table这个tag下面的tag都会单独被打印出来，不仅打印<tr>和</tr>之间的内容，还会把<tr>和</tr>里面的其他tag也都一项一项拎出来，导致十分冗余。

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html, 'html.parser')
for sibling in bs.find("table",{"id":"giftList"}).tr.next_siblings:
    print(sibling)

next_siblings在这里表示跳过第一个tr之后的所有tr；

此外还有previous_siblings,意思正好相反。

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html, 'html.parser')
print(bs.find("img",{"src":"../img/gifts/img1.jpg"}).parent.previous_sibling.get_text())

parent表示找到父级tag，previous_sibling则找到父级tag的前一个平级tag；

像这里找到的是表格里价格那一列的text。

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html,'html.parser')
imgs=bs.find_all("img",{"src":re.compile('../img/gifts/img.*.jpg')})
for img in imgs:
    print(img)
    print(img["src"])

这里导入re模块，用正则表达式来进行匹配，其中.*表示贪婪匹配，先匹配至img的最后, 然后向前匹配, 直到可以匹配到。最终打印出所有图片的链接。

常用的正则表达式符号

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html,'html.parser')
attrs=bs.find_all(lambda tag:len(tag.attrs)==2)
print(attrs)

用lambda表达式来筛选tag，上面表示有两个attribute的tag。

weixin_39901213

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
beautifulsoup html内容_用python进行网络爬虫——chapter2：高级HTML解析

from urllib.request import urlopenfrom bs4 import BeautifulSouphtml=urlopen("http://www.pythonscraping.com/pages/warandpeace.html")bs=BeautifulSoup(html.read(),"html.parser")namelist=bs.find_all("...
复制链接

扫一扫