进阶爬虫：用 selenium 爬取 GitHub 2.0

最新推荐文章于 2024-04-18 04:37:37 发布

utopianist

最新推荐文章于 2024-04-18 04:37:37 发布

阅读量567

点赞数

本文链接：https://blog.csdn.net/qq_35193302/article/details/83993395

版权

关键字： selenium 爬虫

GitHub:https://github.com/utopianist/GitHubZip2.0

前言

我在微信公众号写的第一篇推文：爬爬看：爬取 GitHub 项目 Zip 文件，保存并解压。

发布的时间不过十多天，GitHub 网站的 URL 规则就更改了。

昨天更新了这部分代码：

GitHub：https://github.com/utopianist/GitHub_ZIP

那么今天这篇文章的名字为什么叫“进阶爬虫：用 selenium 爬取 GitHub 2.0“呢？

初衷在于我观察在 GitHub 上一些上传项目过多，对社区贡献很大的人，GitHub 会更改他 Repositories 的 URL 规则。例如：

崔庆才：https://github.com/Germey?tab=repositories

廖雪峰：https://github.com/michaelliao?tab=repositories

可能是 GitHub 对上述的 URL 的翻页参数进行了加密处理，不过这也是我们祭出大杀器 selenium 的原因。

selenium

在无法发现 URL 参数规则的情况下，selenium 可以让用户模拟平时使用浏览器的行为。

如果我找不到第二页的 URL ，那么我点击页面的 ”第二页“ 按钮不就行了吗。

如果到达第二页，又会有新的问题：如何知道第二页的 URL 呢？

解决这个问题，我们用到 selenium 库下的 driver.current_url ，可以返回当前页面的 URL 。

官方文档：

https://selenium-python.readthedocs.io/index.html

翻页函数

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

browser = webdriver.Chrome()

def paging(url):
    try:
        browser.get(url)
        paging = WebDriverWait(browser, 10).until(
            EC.element_to_be_clickable((By.CSS_SELECTOR, "#js-pjax-container > div > div.col-9.float-left.pl-2 > div.position-relative > div.paginate-container > div > a:nth-child(2)"))
        ) #定义翻页行为
        paging.click()

最低0.47元/天解锁文章

utopianist

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
进阶爬虫：用 selenium 爬取 GitHub 2.0

关键字： selenium 爬虫GitHub:https://github.com/utopianist/GitHubZip2.0前言我在微信公众号写的第一篇推文：爬爬看：爬取 GitHub 项目 Zip 文件，保存并解压。发布的时间不过十多天，GitHub 网站的 URL 规则就更改了。昨天更新了这部分代码：GitHub：https://github.com/utopiani...
复制链接

扫一扫