第2章复杂HTML解析

最新推荐文章于 2024-09-12 23:06:08 发布

badapplecn

最新推荐文章于 2024-09-12 23:06:08 发布

阅读量238

点赞数

分类专栏： Python网络数据采集笔记文章标签：数据采集爬虫 python

本文链接：https://blog.csdn.net/badapplecn/article/details/72963752

版权

Python网络数据采集笔记专栏收录该内容

3 篇文章 0 订阅

订阅专栏

第2章复杂HTML解析

2.1不是一直都要用锤子

 
 采集隐藏很深的数据的对策： 

 
 1.寻找“打印此页”的链接，或者看网站有没有移动版； 

 
 2.寻找隐藏在JavaScript文件里面的信息； 

 
 3.从网页的URL链接里获取信息； 

 
 4.找找其他数据源，比如其他网站。 

2.2再来一碗汤

主要讲了CSS给爬虫族带来的福音。

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/warandpeace.html")
bsObj = BeautifulSoup(html,"html.parser")

nameList = bsObj.find_all("span",{"class":"green"})#书上用的findAll
for name in nameList:
    print(name.get_text())

什么时候使用 get_text() 与什么时候应该保留标签？
.get_text() 会把你正在处理的 HTML 文档中所有的标签都清除，然后返回一个只包含文字的字符串。假如你正在处理一个包含许多超链接、段落和标签的大段源代码，那么 .get_text() 会把这些超链接、段落和标签都清除掉，只剩下一串不带标签的文字。

2.2.1　BeautifulSoup的 find() 和 findAll()

findAll(tag, attributes, recursive, text, limit, keywords)
find(tag, attributes, recursive, text, keywords)

①tag可以是一个或多个标签组成的python列表。
如：.findAll({"h1","h2","h3","h4","h5","h6"})#也可以用“[]”

 
 ②attributes 是用 
 一个 P 
 ython 字典封装一个标签的若干属性和对应的属性值。 

如：.findAll("span", {"class":{"green", "red"}})

③递归参数 recursive 是一个布尔变量。你想抓取 HTML 文档标签结构里多少层的信息？如果recursive 设置为 True ， findAll 就会根据你的要求去查找标签参数的所有子标签，以及子标签的子标签。默认是True。

④文本参数 text 是用标签的文本内容去匹配，而不是用标签的属性。

如：nameList = bsObj.findAll(text="the prince")

⑤ limit 只用于 findAll 方法。 find 其实等价于 findAll 的 limit 等于1 时的情形。如果你只对网页中获取的前 x 项结果感兴趣，就可以设置它。

⑥ keyword 用于选择那些具有指定属性的标签。

如：allText = bsObj.findAll(id="text")

2.2.2　其他BeautifulSoup对象

BeautifulSoup 对象、 Tag 对象、NavigableString 对象（用来表示标签里的文字）、Comment 对象（用来查找 HTML 文档的注释标签， ）

2.2.3　导航树

1.处理子标签和后代标签。

children和descendants。

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html)
for child in bsObj.find("table",{"id":"giftList"}).children:
    print(child)

2.处理兄弟标签。

next_siblings（）函数擅长处理带标题行的表格。

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html)
for sibling in bsObj.find("table",{"id":"giftList"}).tr.next_siblings:
    print(sibling)

和 next_siblings 一样，如果你很容易找到一组兄弟标签中的最后一个标签，那么previous_siblings 函数也会很有用。

3.处理父标签

parent 和 parents

2.3正则表达式

在线正则表达式测试：

http://www.regexpal.com/

http://tool.oschina.net/regex

2.4　正则表达式和BeautifulSoup

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

url = 'http://www.pythonscraping.com/pages/page3.html'
html = urlopen(url)

bsObj = BeautifulSoup(html,'html.parser')

images = bsObj.find_all('img',{'src':re.compile('\.\.\/img\/gifts\/img.*\.jpg')})

for image in images:
    print(image['src'])