使用Selenium和chromedriver爬取简书部分文章

最新推荐文章于 2022-07-02 12:13:45 发布

随着风儿去流浪

最新推荐文章于 2022-07-02 12:13:45 发布

阅读量236

点赞数

分类专栏： PythonWeb python爬虫文章标签： xpath chrome selenium python

本文链接：https://blog.csdn.net/weixin_45915507/article/details/115015301

版权

PythonWeb 同时被 2 个专栏收录

6 篇文章 0 订阅

订阅专栏

python爬虫

2 篇文章 0 订阅

订阅专栏

文章目录

分析需求

需要爬取简书网上文章的标题和摘要，以及点击进入后的详情页的具体文章内容。
在这里插入图片描述
然后点击进入后需要爬取的内容为：

通过查看网页源代码发现文章首页里面的内容无法在网页源代码中找到，可大胆猜测其应该为AJAX动态加载的内容，通过查找详情页的数据发现想要的内容直接在网页源代码中即可找到，因此，仅需要使用xpath即可提取需要的内容。

选择工具

由于只是少部分的提取数篇文章，原先想到的是使用request即可，结果反扒措施暂未破解，因此不得不使用selenium配合着google浏览器。需要首先下载一个google浏览器的webdriver
安装Selenium和chromedriver：

安装Selenium：Selenium有很多语言的版本，有java、ruby、python等。我们下载python版本的就可以了。

pip install selenium

安装chromedriver：下载完成后，放到不需要权限的纯英文目录下就可以了。

代码

# -*- coding: utf-8 -*-
# @Time    : 2021/3/20 19:41
# @Author  : JAKE4545
# 使用:selenium 对简书部分内容进行爬取
from selenium import webdriver
import time
from lxml import etree
from selenium.webdriver.common.action_chains import ActionChains

# webdriver存放的路经
driver_path = your_path # 
# 初始化一个driver，并且指定chromedriver的路径
def start_page():
    driver = webdriver.Chrome(executable_path=driver_path)
    url = 'https://www.jianshu.com/'
    driver.get(url)
    detail_page(driver)
    start_page_html = etree.HTML(driver.page_source)
    titles = start_page_html.xpath("//a[@class='title']/text()")
    # title_urls = start_page_html.xpath("//a[@class='title']/@href")
    contents = start_page_html.xpath("//p[@class='abstract']/text()")
    inputTags = driver.find_elements_by_class_name('title')
    try:
        for i, inputTag in enumerate(inputTags):
            inputTag.click()
            dict = {}
            dict['title'] = titles[i]
            dict['content'] = contents[i]
            dict['article'] = detail_page(driver)
            print(dict)
            save_to_json(dict)
    except Exception as e:
        print('------------出现错误，内容如下：----------')
        print(e)
        print('爬取完成')

def detail_page(driver):
    driver.switch_to.window(driver.window_handles[1])
    time.sleep(5)
    htm = etree.HTML(driver.page_source)
    article = htm.xpath("//article[@class='_2rhmJa']/p/text()")
    driver.close()
    driver.switch_to.window(driver.window_handles[0])
    return article

def save_to_json(dict):
    with open('jianshu.txt', 'a+',encoding='gbk', errors='ignore') as fp:
        str_dict = str(dict)
        fp.write(str_dict)
        fp.write(',\n')


if __name__ == '__main__':
    start_page()

结果与总结

网站的更新速度总是很快的，于其直接copy代码运行，不如触类旁通，代码复用。
得到的内容还是足够开展下一步实验项目了
在这里插入图片描述

随着风儿去流浪

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
使用Selenium和chromedriver爬取简书部分文章

由于目前在学习django框架写一个类似于博客的系统，但苦于没有找到合适的文章数据，因此不得不找点素材填充博客demo页面，因此使用requests库对简书网站的数据进行少量的爬取，代码和内容仅用于学习。
复制链接

扫一扫

专栏目录