Python网络数据采集2：复杂HTML解析

最新推荐文章于 2024-08-07 15:34:42 发布

CopperDong

最新推荐文章于 2024-08-07 15:34:42 发布

阅读量929

点赞数 1

分类专栏：爬虫

本文链接：https://blog.csdn.net/QFire/article/details/78903117

版权

爬虫专栏收录该内容

20 篇文章 3 订阅

订阅专栏

2.1 不是一直都要用锤子

如果直接HTML标签中的信息，网站管理员对网站稍微修改之后，爬虫就会失效，那么该怎么做呢？

寻找“打印此页”的链接，或者接受网站移动版
寻找隐藏在JavaScripy文件里的信息。
虽然网页标题经常会用到，但是这个信息也许可以从网页的URL链接里获取
寻找其他数据源

2.2 再端一碗BeautifulSoup

Spider可以通过CSS中的Class属性来抓取，如所有红色文字

from urllib.request import urlopen
from bs4 import BeautifulSoup
try:
	html = urlopen("http://www.pythonscraping.com/pages/warandpeace.html")
except HTTPError as e:
	print(e)
else:
	bsObj = BeautifulSoup(html)
	nameList = bsObj.findAll("span", {"class":"green"})   
	for name in nameList:
		print(name.get_text())

调用bsObj.findAll(tagName, tagAttributes)可以获取页面中所有指定的标签，如<span class="green"></span>

name.get_text()会把你正在处理的HTML文档中所有的标签都清除，然后返回一个只包含文字的字符串。
BeautifulSoup里的 find()和findAll()可能是最常用的两个函数，可以通过标签的不同属性轻松地过滤HTML页面，查找需要的标签组或单个标签。

findAll(tag, attributes, recursive, text, limit, keywords)

find(tag, attributes, recursive, text, keywords)

其他BeautifulSoup对象：BeeutifulSoup对象、标签Tag对象、NavigableString对象（表示标签里的文字）、Comment对象（查找HTML文档的注释标签）

导航树：通过标签在文档中的位置来查找标签，如bsObj.tag.subTag.anotherSubTag

http://www.pythonscraping.com/pages/page3.html

（1）处理子标签和其他后代标签：bsObj.div.findAll("img")会找出文档中第一个div标签，然后获取这个div后代里所有的img标签列表

只想找出子标签，可以用.children标签：

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html)
for child in bsObj.find("table",{"id":"giftList"}).children:
	print(child)

（2）处理兄弟标签：next_siblings()函数可以让收集表格数据成为简单的事情，尤其是处理带标题行的表格

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html)
for sibling in bsObj.find("table",{"id":"giftList"}).tr.next_siblings:
	print(sibling)

（3）父标签处理：parents和parents

print(bsObj.find("img",{"src":"../img/gifts/img1.jpg"}).parent.previous_sibling.get_text())

2.3 正则表达式

regex经常被嘲笑是一堆随机符号的混合物，看着毫无意义，这种印象让人对其避而远之，然后费尽心思写一堆没必要又复杂的查找和过滤函数，其实他们真正需要的就是一行正则表达式。

https://www.regexpal.com/在线测试正则表达式

识别邮箱地址[A-Za-z0-9\._+]+@[A-Za-z]+\.(com|org|edu|net)

2.4 正则表达式和BeautifulSoup

都是以../img/gifts/img开头，以.jpg结尾

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html)
images = bsObj.findAll("img",{"src":re.compile("\.\.\/img\/gifts\/img.*\.jpg")})
for image in images:
	print(image["src"])

markup_type=markup_type))
../img/gifts/img1.jpg
../img/gifts/img2.jpg
../img/gifts/img3.jpg
../img/gifts/img4.jpg
../img/gifts/img6.jpg