selenium免登陆抓取数据

LYD666888999

已于 2023-02-23 19:28:16 修改

阅读量155

点赞数

文章标签： chrome python 前端 selenium 爬虫 Powered by 金山文档

于 2023-02-23 19:20:50 首次发布

本文链接：https://blog.csdn.net/LYD666888999/article/details/129188090

版权

该脚本使用selenium库和Chromewebdriver进行网页交互，进入特定iframe抓取评论数据，包括名字、邮箱、评论内容和时间。数据解析利用lxml的etree模块加速处理，之后将信息保存到json文件。同时，脚本通过schedule库实现每天21点自动运行此爬虫任务。

摘要由CSDN通过智能技术生成

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
import json
from lxml import etree
import schedule

def crawlerData(driver):
    web_driver=driver
    # 进入iframe页面中
    driver.switch_to_frame(driver.find_element_by_xpath('//iframe[@title="オンラインストア"]'))
    EXITS = True
    count=1
    while EXITS:
        try:
            print(count)
            tables = web_driver.page_source
            # 使用 lxml解析，提升解析速度
            html = etree.HTML(tables, etree.HTMLParser())
            tables = html.xpath('//ul[@class="Polaris-ResourceList_r589e"]/li')
            for tr in tables:
                data = {}
                data["name"] = (tr.xpath('.//div[@class="ER3pl"]//text()'))[0]
                data["email"] = (tr.xpath('.