python网络爬虫实验一：文本反爬网站的分析和爬取

最新推荐文章于 2024-01-04 14:22:25 发布

Jin4869

最新推荐文章于 2024-01-04 14:22:25 发布

阅读量689

点赞数

分类专栏： Python网络爬虫文章标签： python 爬虫 chrome

本文链接：https://blog.csdn.net/Jin4869/article/details/128089921

版权

Python网络爬虫专栏收录该内容

5 篇文章 1 订阅

订阅专栏

实验一：文本反爬网站的分析和爬取

实验目的

熟悉使用 Selenium，Puppeteer 等工具爬取网站基本内容

环境

Selenium 库
PyQuery 库
Chrome 和对应版本的 ChromeDriver

基本要求

将网站https://antispider3.scrape.center/一页每本书的信息保存在一个 json 文件中，每个 json 文件命名为书名.json，内容为保存书籍相应的信息

实验过程

导入的包

import bs4
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
from bs4 import BeautifulSoup
import re
import numpy as np
import json
from pyquery import PyQuery as pq

启动 ChromeDriver

显式等待 WebDriverWait()
- driver: webdriver 的驱动
- timeout: 最长超时时间
- 在设置时间 timeout 内，每隔一段时间检测一次当前页面元素是否存在，超过设置时间检测不到抛出异常 ignored_exceptions，元素存在则立即反馈
- 配合 until()，能够判断条件灵活等待
  - EC.presence_of_all_elements_located()：定位的元素范围内，是否至少有一个元素存在于页面当中，如果是，返回满足条件的所有元素组成的 List，否则返回空 List
每本书的 HTML 结构

browser = webdriver.Chrome() # 声明浏览器对象
browser.get('https://antispider3.scrape.center/')

# CSS_SELECTOR 选择所有class为item的元素，没有返回值对象，再进行等待

WebDriverWait(browser, 10) \
    .until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '.item')))
html = browser.page_source # 拿到的是下一页的内容源码

pyquery 用法：

通过该对象可以访问字符串中的 title 节点
还会将残缺的 html 文档补全

doc = pq(html)

# print(html)

定义标题，作者，路径

titles = []
authors = []
urls = []

创建 BeautifulSoup 对象

soup = BeautifulSoup(html, "html.parser") # 固定用法

爬取书籍信息

soup.find_all() 搜索当前 tag 的所有 tag 子节点，并判断是否符合过滤器条件
.text 节点的文本

观察书籍的节点特征

class="name whole" 此时书籍 title 是正序排列
class="m-b-sn name" 此时书籍 title 变了顺序，需要进一步处理 get_title()

取得 title 的函数

re 库
- re.findall()函数返回包含所有匹配项的列

numpy.argsort() 返回的是数组值从小到大的索引值

例：

>>> import numpy as np
>>> x=np.array([1,4,3,-1,5,9])
>>> x.argsort()
array([3,0,2,1,4,5)]

根据偏移量取 title

def get_title(tag: bs4.Tag) -> str:
    tokens = []
    tokens_pos = []
    for span in tag.childGenerator():
        if len(re.findall("\S+", span.text)) > 0:
            token = re.findall("\S+", span.text)[0]
        else:
            token = " "
        tokens.append(token)
        style = span["style"]
        #"\d+"表示匹配数字部分
        token_pos = int(re.findall(r"\d+", style)[0])
        tokens_pos.append(token_pos)
    #argsort()返回的是数组值从小到大的索引值
    idxs = np.array(tokens_pos).argsort()
    name = ""
    for idx in idxs:
        name += tokens[idx]
    return name

# 爬取title
h3s = soup.find_all("h3")

for h3 in h3s: # 对h3中类的不同做不同的操作
    title = ""
    if h3["class"] == ["name", "whole"]:
        title = h3.text
    elif h3["class"] == ["m-b-sm", "name"]:
        title = get_title(h3)
    titles.append(title)
    print(title)

爬取书籍封面图片的 url

在这里插入图片描述

筛选 class:cover
tag["src"] src=“”

# 爬取url
tag_img = soup.find_all("img", {"class": "cover"})
for tag in tag_img:
    url = tag["src"]
    urls.append(url)

爬取作者名称

在这里插入图片描述

author.text()

# 爬取作者名称 .authors是.item的子节点
authors1 = doc('.item .authors')
for author in authors1.items():
    authors.append(author.text())

保存 json 文件

r”./book_{}”.format()正则匹配
json.dump()编码，用于将 dict 类型的数据转成 str 类型，并写入到 json 文件
保存形式：

#   保存为json文件
book_dict = {
    "title": "",
    "cover_url": "",
    "authors": ""
}

for i in range(len(titles)):
    with open(r"./book_{}".format(i), "w+", encoding="utf-8") as fp:
        book_dict["title"] = titles[i]
        book_dict["cover_url"] = urls[i]
        book_dict["authors"] = authors[i]
        json.dump(book_dict, fp, ensure_ascii=False, indent=2)