【Python网络爬虫】python网络数据采集读书笔记（第二章）

最新推荐文章于 2023-02-28 20:33:32 发布

Tag_sk

最新推荐文章于 2023-02-28 20:33:32 发布

阅读量428

点赞数 1

分类专栏： Python爬虫文章标签： python html 读书笔记网络爬虫

本文链接：https://blog.csdn.net/github_35746658/article/details/53886972

版权

Python爬虫专栏收录该内容

3 篇文章 0 订阅

订阅专栏

python网络数据采集

第二章复杂HTML解析

demo1

这个demo展示了利用BS4，解析css来抽离出小说中的人物的登场次序。这个网址可以打开看看，也许你就明白作者的意图了。

from urllib.request import urlopen
from bs4 import BeautifulSoup
#下面这个网址是作者弄的示例页面
html=urlopen('http://www.pythonscraping.com/pages/warandpeace.html')
bsobj=BeautifulSoup(html)

namelist=bsobj.findAll('span',{'class':'green'})
for name in namelist:
    print(name.get_text())
    #.get_text()是bs4中的函数，用于将html文档中的所有标签都清除，只包含文字

demo2

解释find（）函数和findAll（）函数

findAll(tag,attributes,resursive,text,limit,keywords)

find(tag,attributes,resursive,text,limit,keywords)

#tag,传入一个标签的名称或者多个标签名称组成python列表做标签参数如
html.findAll({'h1','h2','h3'})

#attributes,用一个python字典封装一个标签的若干属性和对应的属性值。如
html.findAll('span',{'class':{'green','red'}})

#recursive是一个布尔变量。值为True则会按照你的要求去爬取所有子标签，否则只查找文档的一级标签。

#text，用标签的文本内容去匹配

#limit,范围限制参数

#keyword,可以选择那些具有指定属性的标签

demo3

介绍下BeautifulSoup的几个对象
- BeautifulSoup对象
- 标签Tag对象
- NavigableString对象
- Comment对象

demo4

处理子标签

from urllib.request import urlopen
from bs4 import BeautifulSoup
html=urlopen('http://www.pythonscraping.com/pages/page3.html')
bsobj=BeautifulSoup(html)
for child in bsobj.find('table',{'id':'giftList'}).children:
    print(child)

demo5

处理兄弟标签

from urllib.request import urlopen
from bs4 import BeautifulSoup

html=urlopen('http://www.pythonscraping.com/pages/page3.html')
bsobj=BeautifulSoup(html)

for sibling in bsobj.find('table',{'id':'giftList'}).tr.next_siblings:
    print(sibling)
    #打印产品列表中的所有行的产品，第一行表格标题除外

demo6

处理父标签

from urllib.request import urlopen
from bs4 import BeautifulSoup

html=urlopen('http://www.pythonscraping.com/pages/page3.html')
bsobj=BeautifulSoup(html)
print(bsobj.find('img',{'src':'../img/gifts/img1.jpg'}).parent.previous_sibling.get_text())

demo7

正则表达式

demo8

正则表达式与BeautifulSoup组合使用

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re  #用于正则表达式的库

html=urlopen("http://www.pythonscraping.com/pages/page3.html")
bsobj=BeautifulSoup(html)
images=bsobj.findAll('img',{'src':re.compile("\.\.\/img\/gifts/img.*\.jpg")})
for image in images:
    print(image["src"])