python爬虫之BeautifulSoup(爬取猫眼TOP100、中国最好大学排行)

BeautifulSoup爬虫

在这里插入图片描述

什么是BeautifulSoup

是用Python写的一个HTML/XML的解析器,它可以很好的处理不规范标记并生成剖析树(parse tree)。 它提供简单又常用的导航(navigating),搜索以及修改剖析树的操作。利用它我们不在需要编写正则表达式就可以方便的实现网页信息的提取。

就像java实现爬虫一样有HttpClient+Jsoup,python中我们就能用requests+BeautifulSoup来实现

BeautifulSoup官方文档:https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/

首先我们需要下载BeautifulSoup,我们还需要配合一个解析器来使用这里讲的是lxml

pip install bs4 安装BeautifulSoup
pip install lxml 安装lxml解析器

然后我们先拿官网上一段html来解析一下,了解它的用法,具体方法上图和注解里面都有,自己去看了
demo1

"""
使用BeautifulSoup
"""

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
# 实例化BeautifulSoup对象
soup = BeautifulSoup(html_doc, "lxml")
# 格式化输出(按照HTML格式输出)
# print(soup)
# print(soup.prettify())

# 获取标签
# tag = soup.title
# name = tag.name
# str = tag.string
# print(tag)
# print(name)
# print(str)

# tag = soup.p
# print(tag)

# 获取所有的P标签
# tags = soup.find_all("p", attrs={"class": "story"})
# print(len(tags))

# tag = soup.find(class_="title")
# print(tag)

# tag = soup.title
# print(tag.parent.name)


# 获取属性的值
# tag = soup.p
# str = tag.get("class")
# print(str)

# str = soup.p["class"]
# print(str)

# str = soup.a.get("id")
# print(str)


# 获取a标签的所有属性值
# attrs = soup.a.attrs
# print(attrs)

# tag = soup.body
# for c in tag.descendants:
#     print(c)
#     print("*"*30)

# for p in soup.title.parents:
#     print(p.name)


# 先获取第一个

# tag = soup.a
# print(tag.next_sibling.next_sibling)

for i in range(10):
    print(i)

通过BeautifulSoup解析,我们减少了写正则来筛选内容的步骤,我们可以直接通过方法来获取我们想要的东西,所以说比正则还是方便不少

demo2
爬取猫眼电影TOP100电影的图片,并存入文件夹

思路和之前一样的,无论我们用哪种技术,哪种方式,都需要先分析我们要爬取的信息,再通过不同的手段筛选出我们要的信息,最终爬取出来

"""
爬取“猫眼电影的排行榜”
"""

import requests
from bs4 import BeautifulSoup
import os

headers = {
    "User-Agent": "Opera/9.80 (Windows NT 6.0) Presto/2.12.388 Version/12.14"
}

# 获取当前根目录
getcwd = os.getcwd()

for j in range(10):
    # 进入当前跟目录
    os.chdir(getcwd)

    # 在根目录中船舰文件夹“第1页”
    os.mkdir(f"第{j+1}页")

    # 改变当前目录
    os.chdir(f"第{j+1}页")

    response = requests.get(f"https://maoyan.com/board/4?offset={j*10}", headers=headers)

    if response.status_code == 200:
        # 解析网页
        soup = BeautifulSoup(response.text, "lxml")
        imgTag = soup.find_all("img",attrs={"class":"board-img"})
        for imgTag in imgTag:
            name = imgTag.get("alt")
            src = imgTag.get("data-src")
            resp = requests.get(src,headers=headers)
            with open(f"{name}.png","wb") as f:
                f.write(resp.content)
            print(f"{name}{src}保存成功")

最终效果
在这里插入图片描述

demo3
爬取中国最好大学的排名

"""
爬取“最好大学网”排行
"""

import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Opera/9.80 (Windows NT 6.0) Presto/2.12.388 Version/12.14"
}
response = requests.get("http://www.zuihaodaxue.com/zuihaodaxuepaiming2019.html", headers=headers)
response.encoding = "utg-8"
if response.status_code == 200:
    soup = BeautifulSoup(response.text,"lxml")
    trTags = soup.find_all("tr",attrs={"class":"alt"})
    for trTag in trTags:
        id = trTag.contents[0].string
        name = trTag.contents[1].string
        addr = trTag.contents[2].string
        sco = trTag.contents[3].string
        print(f"{id}{name}{addr}{sco}")

最终效果
在这里插入图片描述

©️2020 CSDN 皮肤主题: 技术黑板 设计师:CSDN官方博客 返回首页