selenium-公众号爬取-0.1

噜啦噜啦嘞113

已于 2022-03-16 13:15:10 修改

阅读量732

点赞数 2

分类专栏：爬虫文章标签：微信 selenium python

于 2022-03-15 02:24:21 首次发布

本文链接：https://blog.csdn.net/qq_45209288/article/details/123492720

版权

爬虫专栏收录该内容

2 篇文章 0 订阅

订阅专栏

需求

一个同事最近在追女生，对方是老师，平时可能需要阅读一些文章，同事想投其所好，就找到我，想让我帮忙写个脚本爬一下一些好一点学校的文章，到时候做个网页，搜一下关键字，就能展示相关文章。

过程1

首先找了几个好一点学校的官网，看了下，基本没有反爬，很快就爬了。


import urllib.request
from urllib.parse import urljoin
from bs4 import BeautifulSoup
import time
# 读取给定 url 的 html 代码

ind = 15
for num in range(ind):
    index = num + 1
    school_url = "不能写具体网站"
    print(index)

    response = urllib.request.urlopen(school_url)
    content = response.read().decode('utf-8')
    # 转换读取到的 html 文档
    soup = BeautifulSoup(content, 'html.parser', from_encoding='utf-8')
    # 获取转换后的 html 文档里属性 class=list-main-warp 的 div 标签的内容
    divs = soup.find_all('div', {'class': "listbox"})
    # 从已获取的 div 标签的内容里获取 li 标签的内容
    lis = divs[0].find_all('li')

    # div_page = soup.find_all('div', {'class': "page"})
    # last_num = div_page[0].find_all('a')[-1].get("href")
    # print(last_num)
    with open('urlList.txt', 'a+', encoding='utf8') as fp:

        for li in lis:
            url2 = li.find_all('a')[0].get("href")
            if url2 is None: #如果url2为空
                url1 = "http://sysyx.com.cn/"  # 基础母站
                url2 = li.find_all('a')[1].get("href")
                url2 = urljoin(url1, url2)
                title = li.find_all('a')[1].text
                print(title)
                print(url2)
                fp.write(url2 + "," + title + '\n')

            else:
                url1 = "http://sysyx.com.cn/"  # 基础母站
                url2 = li.find_all('a')[0].get("href")
                url2 = urljoin(url1, url2)
                title = li.find_all('a')[0].text
                print(title)
                print(url2)
                fp.write(url2 + "," + title + '\n')

    #    if title is None:
    #        title = li.find_all('')[0].text
            # 打印拼接的路径和对应的新闻标题

不过基本是一个学校需要一个脚本，代码的复用性太低了，而且很重要的一点就是，很多学校不在官网更新文章了。。。但基本都会在公众号上更新，于是。。。

过程2

爬公众号代码实现很简单，复杂的是找到路径，想了半天，想起来之前在微信公众平台上写小程序的时候见到公众号了，毫不迟疑——
在这里插入图片描述
太久没用被封了。。。找同事又要了个微信号开通了这玩意儿。

分析

在这里插入图片描述

在这里插入图片描述
好了，以上就是selenium要做的事情。

具体实现

# *coding:utf-8 *.
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
import time
from selenium.webdriver.common.keys import Keys

# 调用环境变量指定的Chrome浏览器创建浏览器对象
driver_path = r'C:\Program Files\Google\Chrome\Application\chromedriver.exe'
driver = webdriver.Chrome(executable_path=driver_path)
# get 方法会一直等到页面被完全加载，才会继续程序
time.sleep(2)
driver.get('https://mp.weixin.qq.com/cgi-bin/home?t=home/index&lang=zh_CN&token=1280197235')
href1 = driver.find_element_by_id('jumpUrl')
ActionChains(driver).click(href1).perform()
time.sleep(10)
href2 = driver.find_element_by_css_selector('#app > div.main_bd_new > div:nth-child(4) > div.weui-desktop-panel__bd > div > div:nth-child(1)')

ActionChains(driver).click(href2).perform()
driver.switch_to.window(driver.window_handles[-1])
time.sleep(2)
btn1 = driver.find_element_by_id('js_editor_insertlink')
ActionChains(driver).click(btn1).perform()
time.sleep(1)
btn2 = driver.find_element_by_xpath('//*[@id="vue_app"]/div[2]/div[1]/div/div[2]/div[2]/form[1]/div[4]/div/div/p/div/button')
ActionChains(driver).click(btn2).perform()
time.sleep(1)
driver.find_element_by_xpath('//*[@id="vue_app"]/div[2]/div[1]/div/div[2]/div[2]/form[1]/div[4]/div/div/div/div/div[1]/span/input').send_keys("公众号名字")
time.sleep(1)
# driver.find_element_by_xpath('//*[@id="vue_app"]/div[2]/div[1]/div/div[2]/div[2]/form[1]/div[4]/div/div/div/div/div[1]/span/span/button').click()
driver.find_element_by_xpath('//*[@id="vue_app"]/div[2]/div[1]/div/div[2]/div[2]/form[1]/div[4]/div/div/div/div/div[1]/span/input').send_keys(Keys.ENTER)
time.sleep(3)
driver.find_element_by_css_selector('#vue_app > div.weui-desktop-link-dialog > div.weui-desktop-dialog__wrp > div > div.weui-desktop-dialog__bd > div.link_dialog_panel > form:nth-child(1) > div:nth-child(4) > div > div > div > div.weui-desktop-search__panel > ul > li:nth-child(1) > div.weui-desktop-vm_primary').click()
time.sleep(3)
title_list = driver.find_elements_by_class_name('inner_link_article_item')
pages = int(driver.find_element_by_xpath('//*[@id="vue_app"]/div[2]/div[1]/div/div[2]/div[2]/form[1]/div[5]/div/div/div[3]/span[1]/span/label[2]').get_attribute('textContent'))
for i, temp in enumerate(title_list):
    spans = temp.find_element_by_class_name('inner_link_article_title').find_elements_by_tag_name('span')
    title = spans[1].get_attribute('textContent')
    date = temp.find_element_by_class_name('inner_link_article_date').get_attribute('textContent')
    href = temp.find_element_by_tag_name('a').get_attribute('href')
    print("title:", title)
    print('date:', date)
    print("href:", href)

for x in range(pages-1):
    if x > 0:
        xpth_txt = '//*[@id="vue_app"]/div[2]/div[1]/div/div[2]/div[2]/form[1]/div[5]/div/div/div[3]/span[1]/a[2]'
    else:
        xpth_txt = '//*[@id="vue_app"]/div[2]/div[1]/div/div[2]/div[2]/form[1]/div[5]/div/div/div[3]/span[1]/a'
    driver.find_element_by_xpath(xpth_txt).click()

    time.sleep(3)
    title_list = driver.find_elements_by_class_name('inner_link_article_item')
    for i, temp in enumerate(title_list):
        spans = temp.find_element_by_class_name('inner_link_article_title').find_elements_by_tag_name('span')
        title = spans[1].get_attribute('textContent')
        date = temp.find_element_by_class_name('inner_link_article_date').get_attribute('textContent')
        href = temp.find_element_by_tag_name('a').get_attribute('href')
        print("title:", title)
        print('date:', date)
        print("href:", href)

结果

获取title_list的时候差点给我弄疯了，就因为driver.find_elements_by_class_name中elements没加s，整了半个多小时才发现原因，还是太久没写搞得。。。
写的非常粗糙而且还没写完，啥时候有时间了再优化补充。
总之还是那句话，实现起来并不难，关键就是找到怎么爬取得路径。

每篇一句：人生一次，怎愿甘拜下风！我可以！淦，睡觉

。。。审核不通过一百次

噜啦噜啦嘞113

关注

2
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
selenium-公众号爬取-0.1

需求一个同事最近在追女生，对方是老师，平时可能需要写一些文章，同事想投其所好，就找到我，想让我帮忙写个脚本爬一下一些好一点学校的文章，到时候做个网页，搜一下关键字，就能展示相关文章。过程1首先找了几个好一点学校的官网，看了下，基本没有反爬，很快就爬了。import urllib.requestfrom urllib.parse import urljoinfrom bs4 import BeautifulSoupimport time# 读取给定 url 的 html 代码ind
复制链接

扫一扫

专栏目录