【python3】pycharm selenium爬取微博内容（通过关键词

泗里

已于 2024-06-20 10:38:50 修改

阅读量8k

点赞数 7

分类专栏：笔记文章标签： selenium python css xpath 微博

于 2021-04-18 19:17:31 首次发布

本文链接：https://blog.csdn.net/m0_57004255/article/details/115772992

版权

笔记专栏收录该内容

1 篇文章 0 订阅

订阅专栏

python(pycharm) 爬取微博内容

通过关键字，爬取微博内容，微博内容，时间，链接等

例如：（文本有些折叠了）

		关键词：台风

在这里插入图片描述

代码如下：

from selenium import webdriver
from lxml import etree
from urllib import parse
from time import sleep
import datetime
from xlutils.copy import copy
import xlrd
import time

keyword = '台风'  # 爬取的关键词
y = 2020  # 起始年
m = 3  # 起始月
d = 10  # 起始日
days = 20  # 爬days天
url_keyword = parse.quote(keyword)  # 将关键词转换成为网址可识别


def getday(y, m, d, n):  # 封装日期
    the_date = datetime.datetime(y, m, d)
    result_date = the_date + datetime.timedelta(days=n)
    d = result_date.strftime('%Y-%m-%d')
    return d


def p(days, x):  # 爬取解析存储

    for i in range(days):
        data = getday(y, m, d, +i)

        for j in range(24):  # 获取24小时的网址
            if j == 23:
                data_add_hour = data + '-' + str(j) + ':' + getday(y, m, d, -(i - 1)) + '-' + str(0)
            else:
                data_add_hour = data + '-' + str(j) + ':' + data + '-' + str(j + 1)
            # selenium
            bro = webdriver.Chrome(executable_path=r'D:\python\chorm\chromedriver.exe')
            url = 'https://s.weibo.com/weibo?q=' + url_keyword + '&typeall=1&suball=1&timescope=custom:' + data_add_hour
            print(url)
            bro.get(url)
            sleep(2)  # 等待完整加载
            page_text = bro.page_source  # 完整页面
            sleep(2)
            bro.quit()  # 关闭网页
            # 开始解析
            tree = etree.HTML(page_text)
            print(tree)
            wb_list = tree.xpath("//div[@class='card-feed']")
            # # wb_list = tree.xpath(".// *[ @ id = 'pl_feedlist_index'] //div[@class='card-feed']")
            # # // *[ @ id = "pl_feedlist_index"] / div[2] / div[3] / div / div[1]
            # wb_list = tree.xpath("// *[ @ id = 'pl_feedlist_index'] / div[2] / div[3] / div / div[1]")
            wb_time = tree.xpath(".//*[@id='pl_feedlist_index']/div[2]/div[1]/div/div[1]/div[2]/p[2]/a[1]/text()")
            wb_name = tree.xpath(
                ".//*[@id='pl_feedlist_index']/div[2]/div[2]/div/div[1]/div[2]/div[1]/div[2]/a[1]/text()")
            wb_text = tree.xpath(".//*[@id='pl_feedlist_index']/div[2]/div[2]/div/div[1]/div[2]/p[1]//text() ")
            wb_from = tree.xpath(".//*[@id='pl_feedlist_index']/div[2]/div[5]/div/div[1]/div[2]/p[3]/a[2]/text()")
            wb_href = tree.xpath(".//*[@id='pl_feedlist_index']/div[2]/div[1]/div/div[1]/div[2]/p[2]/a[1]/@href")
            # print(wb_href)
            rb = xlrd.open_workbook('wb.xls')  # 打开文件

            wb = copy(rb)  # 利用xlutils.copy下的copy函数复制
            ws = wb.get_sheet(0)  # 获取表单0

            ws.write(x, 1, wb_name)
            print(wb_name)
            ws.write(x, 2, wb_href)
            print(wb_href)
            ws.write(x, 3, wb_text)
            print(wb_text)
            ws.write(x, 4, wb_time)
            print(wb_time)
            ws.write(x, 5, wb_from)
            print(wb_from)


            x = x + 1
            print(x)
        wb.save('wb.xls')  # 保存文件

if __name__ == '__main__':
    p(days, 1)

有几个问题还没完善

	使用selenium太慢了（考虑多线程同时
	获取的文本和时间有多余空格（正则
	可以在某个时间没有微博爬到空的（添加一个判断

泗里

关注

7
点赞
踩
23

收藏

觉得还不错? 一键收藏
打赏
4
评论
【python3】pycharm selenium爬取微博内容（通过关键词

python(pycharm) 爬取微博内容通过关键字，爬取微博内容，微博内容，时间，链接等在这里插入代码片
复制链接

扫一扫

专栏目录

【python3】pycharm selenium爬取微博内容（通过关键词

python(pycharm) 爬取微博内容

“相关推荐”对你有帮助么？