复杂HTML页面解析

最新推荐文章于 2022-03-14 21:57:56 发布

weixin_33916256

最新推荐文章于 2022-03-14 21:57:56 发布

阅读量318

点赞数

文章标签： python 爬虫

原文链接：http://www.cnblogs.com/no-bald/p/8325663.html

版权

1、层叠样式表CSS可以让html元素呈现出差异化，网络爬虫可以通过class属性的值，轻松分出不同标签

findAll函数通过标签的名称和属性来查找标签

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://pythonscraping.com/pages/warandpeace.html")
bs = BeautifulSoup(html)
namelist = bs.findAll("span",{"class":"green"})#bs.findAll(tagname,tagattributes)
for name in namelist:
    print(name.get_text())#get_text()函数会将html文档中的所有标签都清除，只保留包含文字的字符串

下面两行代码是一致的

bs.findAll(id="text")
bs.findAll("",{"id"="text"})

2、通过导航树可以通过标签在文档中的位置来查找标签

在BeautifulSoup库中，子标签是父标签的下一级，而后代标签是指父标签下面所有级别的标签，库中一般是默认查找后代标签，如果只想要子标签，可以使用。children标签：

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://pythonscraping.com/pages/page3.html")
bs = BeautifulSoup(html)

for child in bs.find("table",{"id":"giftList"}).children:
    print(child)

处理兄弟标签中next_sibling函数可以收集除了第一行表格标题之外的所有行的产品

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://pythonscraping.com/pages/page3.html")
bs = BeautifulSoup(html)

for sibling in bs.find("table",{"id":"giftList"}).tr.next_siblings:
    print(sibling)

3、偶尔使用父标签查找函数，parent和parents

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://pythonscraping.com/pages/page3.html")
bs = BeautifulSoup(html)

print(bs.find("img",{"src":"../img/gifts/img1.jpg"}).parent.previous_sibling.get_text())

4、直接查找标签属性的话，比如标签<a>指向的URL链接包含在href属性中，<img>标签的图片文件包含在src属性中，可以使用以下代码获取全部属性

#maTag.attrs
maImgTag.attrs["src"]

5、正则表达式，下例中，直接通过商品图片的文件路径来查找信息。

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re#正则表达式
html = urlopen("http://pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html)
images = bsObj.findAll("imd",{"src":re.compile("\.\.\/img\/gifts\/img.*\.jpg")})
for image in images:
    print(image["src"])

转载于:https://www.cnblogs.com/no-bald/p/8325663.html

weixin_33916256

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复杂HTML页面解析

1、层叠样式表CSS可以让html元素呈现出差异化，网络爬虫可以通过class属性的值，轻松分出不同标签findAll函数通过标签的名称和属性来查找标签from urllib.request import urlopenfrom bs4 import BeautifulSouphtml = urlopen("http://pythonscraping.com/pages/war...
复制链接

扫一扫