今天学习了用Beautifulsoup函数来获取指定的节点,以及用当前结点顺藤摸瓜找到其子节点,后代节点,兄弟节点,父节点。
练习1 findAll 函数抽取只包含在 标签里的文字
还顺便把class=’red’标签里的内容也提取了
# from urllib.request import urlopen
# from bs4 import BeautifulSoup
# r = urlopen('http://www.pythonscraping.com/pages/warandpeace.html')
# bsObj = BeautifulSoup(r)
# persons = bsObj.findAll('span',{'class':'green'})
# conversasions = bsObj.findAll('span',{'class':'red'})
# for name in persons:
# print(name.get_text())
# print('\n')
# for talks in conversasions:
# print(talks.get_text())
练习2 查找内容匹配的html元素
查找html元素在昨天已经练习过了就是find/findall函数。
利用这两个函数的tag参数与tagAtrribute参数可以让我们检索大多数标签,此外我们还可以通过text参数(下面的例子正是如此)匹配内容包含制定字符串的标签
# from urllib.request import urlopen
# from bs4 import BeautifulSoup
# r = urlopen('http://www.pythonscraping.com/pages/warandpeace.html')
# bsObj = BeautifulSoup(r)
# test = bsObj.findAll(text = 'the prince')
# print(len(test))
练习3 子标签和后代标签 注意他们的区别
子标签就是一个父标签的下一级,而后代标签是指一个父标签 下面所有级别的标签。所有的子标签都是后代标 签,但不是所有的后代标签都是子标签。
# from urllib.request import urlopen
# from bs4 import BeautifulSoup
# r = urlopen('http://www.pythonscraping.com/pages/page3.html')
# bsObj = BeautifulSoup(r)
# for child in bsObj.find('table',{'id':'giftList'}).children:
# print(child)
# print('\n')
# for descendant in bsObj.find('table',{'id':'giftList'}).descendants:
# print(descendant)
练习4 用next_siblings获取兄弟节点
# from urllib.request import urlopen
# from bs4 import BeautifulSoup
# r = urlopen('http://www.pythonscraping.com/pages/page3.html')
# bsObj = BeautifulSoup(r)
# for sibling in bsObj.find('table',{'id':'giftList'}).tr.next_siblings:
# print(sibling)
练习5 用parent/parents操作父节点
# from urllib.request import urlopen
# from bs4 import BeautifulSoup
# r = urlopen('http://www.pythonscraping.com/pages/page3.html')
# bsObj = BeautifulSoup(r)
# money = bsObj.find('img',{'src':'../img/gifts/img1.jpg'}).parent.previous_sibling
# print(money.get_text())