爬虫入门之BeautifulSoup4解析器

什么是BeautifulSoup4

和 lxml 一样,Beautiful Soup 也是python的一个HTML/XML的解析器,用它可以方便的从网页中提取数据。

lxml 只会局部遍历,而Beautiful Soup 是基于HTML DOM的,会载入整个文档,解析整个DOM树,
因此时间和内存开销都会大很多,所以性能要低于lxml。 BeautifulSoup 用来解析 HTML 比较简单
API非常人性化,支持CSS选择器、Python标准库中的HTML解析器,也支持 lxml 的 XML解析器。
Beautiful Soup 3 目前已经停止开发,推荐现在的项目使用Beautiful Soup 4。
使用 pip 安装即可: pip install beautifulsoup4
官方文档:http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0

Beautiful Soup支持的解析器

解析器使用方法优势
Python标准库BeautifulSoup(markup,‘html.parser’)Python的内置标准库、执行速度适中、文档容错能力强
lxml HTML解析器BeautifulSoup(markup,‘lxml’)速度快、文档容错能力强
lxml XML解析器BeautifulSoup(markup,‘xml’)速度快、唯一支持XML的解析器
html5libBeautifulSoup(markup,‘html5lib’)最好的容错性、以浏览器的方式解析文档,生成HTML5的格式文档

Beautiful Soup支持的方法

方法名作用
find获取单个节点
find_all获取所有
name可以是正则表达式、可以是标签名称、可以是标签的列表[‘a’,‘img’]
attrs字典类型,跟标签的属性值

css选择器

这就是另一种与 find_all 方法有异曲同工之妙的查找方法.写 CSS 时,标签名不加任何修饰,类名前加.,id名前加#在这里我们也可以利用类似的方法来筛选元素,用到的方法是soup.select(),返回类型是 list

表达式说明
*选择所有节点
#container选择id为container的节点
.containe选取所有class包含container的节点
li a选取所有li下的所有a节点
div#container > ul选取id为container的div的ul子元素
a[href=“http://jobbole.com”]选取所有href属性为jobbole.com值的a元素
a[href*=”jobole”]选取所有href属性包含jobbole的a元素
a[href^=“http”]选取所有href属性值以http开头的a元素
a[href$=“.jpg”]选取所有href属性值以.jpg结尾的a元素
div:not(#container)选取所有id非container的div属性
li:nth-child(3)选取第三个li元素
tr:nth-child(2n)第偶数个tr
案例
#pip install lxml

from lxml.html import etree
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
class CollegateRank(object):

    def get_page_data(self,url):
        response = self.send_request(url=url)
        if response:
            # print(response)
            with open('page.html','w',encoding='gbk') as file:
                file.write(response)
            self.parse_page_data(response)


    def parse_page_data(self,response):
        # 使用bs4获取数据
        soup = BeautifulSoup(response,'lxml')
        # ranks= soup.select('div.scores_List > dl')
        #soup.find()获取单个节点
        # soup.find_all()获取所有
        # ranks = soup.find_all(attrs={'class':"scores_List"})[0].find_all('dl')
        ranks = soup.find_all(class_="scores_List")[0].find_all('dl')

        # print(ranks)
        for dl in ranks:
            school_info = {}
            school_info['url'] = dl.select('dt a')[0].attrs['href']
            school_info['icon'] = dl.select('dt a')[0].select('img')[0].attrs['src']
            school_info['name'] = dl.select('dt > strong a')[0].text
            school_info['adress'] = dl.select('dd > ul >li')[0].text
            school_info['tese'] = '、'.join(span.text for span in dl.select('dd > ul >li')[1].select('span'))
            school_info['type'] =dl.select('dd > ul > li ')[2].text
            school_info['belong'] =dl.select('dd> ul >li ')[3].text
            school_info['level'] = dl.select('dd > ul >li')[4].text
            school_info['weburl'] = dl.select('dd > ul > li')[5].text
            # print(school_info)

            self.parse_school_detail(school_info['url'], school_info)
    def parse_school_detail(self,url,school_info):

        response = self.send_request(url)
        soup = BeautifulSoup(response,'lxml')
        rank = soup.find_all(class_='sm_nav bk')[0].find_all('p')
        pool = ThreadPoolExecutor(20)
        for i in rank:
            school_info = {}
            school_info['url'] = i.select('p a')[0].attrs['href']

            print(school_info)

            result = pool.submit(self.send_request, school_info['url'])
            result.add_done_callback(self.parse_school_info)
    def parse_school_info(self,future):
        text = future.result()
        print('解析数据', len(text))
        #使用xpath解析数据
        # etree_xpath = etree.HTML(response)
        # ranks = etree_xpath.xpath('//div[@class="scores_List"]/dl')
        # print(ranks)
        # for dl in ranks:
        #     school_info = {}
        #     school_info['url'] = self.extract_first(dl.xpath('./dt/a[1]/@href'))
        #     school_info['icon'] = self.extract_first(dl.xpath('./dt/a[1]/img/@src'))
        #     school_info['name'] = self.extract_first(dl.xpath('./dt/strong/a/text()'))
        #     school_info['adress'] = self.extract_first(dl.xpath('./dd/ul/li[1]/text()'))
        #     school_info['tese'] = '、'.join(dl.xpath('./dd/ul/li[2]/span/text()'))
        #     school_info['type'] = self.extract_first(dl.xpath('./dd/ul/li[3]/text()'))
        #     school_info['belong'] = self.extract_first(dl.xpath('./dd/ul/li[4]/text()'))
        #     school_info['level'] = self.extract_first(dl.xpath('./dd/ul/li[5]/text()'))
        #     school_info['weburl'] = self.extract_first(dl.xpath('./dd/ul/li[6]/text()'))
        #     print(school_info)

    def extract_first(self,data=None,defalut=None):
        if len(data)  > 0:
            return data[0]
        return defalut


    def send_request(self, url, headers=None):
        headers = headers if headers else {
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'}
        response = requests.get(url=url,headers=headers)
        if response.status_code == 200:
            return response.text

if __name__ == '__main__':
    url = 'http://college.gaokao.com/schlist/'
    obj = CollegateRank()
    obj.get_page_data(url)



  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值