微博爬虫一（Selenium）

徐尚

已于 2023-12-06 16:21:46 修改

阅读量1.4k

点赞数 3

分类专栏： python爬虫专栏文章标签：爬虫 selenium 测试工具

于 2020-05-14 21:40:45 首次发布

本文链接：https://blog.csdn.net/weixin_45042620/article/details/106126901

版权

python爬虫专栏专栏收录该内容

6 篇文章 1 订阅

订阅专栏

背景

一般企业做舆情分析，新浪微博是必不可少需要关注的。看看有没有负面消息尽早介入处理。人工查找筛选这些信息显然效率不够高，不够智能；
毕业以后，很少主动去关注母校的一些情况。借学习爬虫的机会，看看母校最近发生了什么。

目标

如下图，准备爬取母校微博账号【南京师范大学】近期发布的内容与互动情况。
爬取的数据包括：

微博发布的时间；
微博的文字内容；
图片信息（URL）；
转发数量；
评论数量，评论人ID与评论内容；
点赞数量

在这里插入图片描述

探索

下拉网页，可以看到微博内容是Ajax动态加载的，细心探索，可以发现下拉3次即可完成整个页面的加载；
在页面底部点击“下一页”，会弹出登录验证信息，需要输入微博账号密码；
通过点击评论，可以看到每条微博下前排的评论人信息和评论内容。

爬虫实施

Step1：导入包，常量定义，初始化Selenium

from selenium.webdriver.common.action_chains import ActionChains
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from lxml import etree
import pandas as pd

url='https://weibo.com/njnusun?'
USER_NAME='xxxxxxxx'
PASSWORD='xxxxx'
driver = webdriver.Chrome(r"C:\Users\ThinkPad\AppData\Local\Google\Chrome\Application\chromedriver.exe")

这里给出了目标网址、微博用户名和密码（需要改成自己的账号信息哦），并打开了谷歌浏览器

Step2：打开目标网页，调整屏幕

driver.get(url)
time.sleep(2)
driver.maximize_window()   #窗口最大化

在这里插入图片描述
此时，Selenium已控制浏览器进入目标网页，不过是游客状态，未登录

Step3：模拟登陆：

在这里插入图片描述

使用xpath定位右上角登陆并模拟点击；
定位到账号输入框，模拟输入USER_NAME ；
定位到密码输入框，模拟输入PASSWORD；
定位到登陆按钮，模拟点击登陆

代码如下：

#设置显示等待，直到右上方登陆按钮可以点击
wait=WebDriverWait(driver,10)
wait.until(EC.presence_of_element_located((By.XPATH,'//a[@node-type="loginBtn"]')))
#定位到登陆按钮，点击
driver.find_element_by_xpath('//a[@node-type="loginBtn"]').click()

#设置显示等待，直到弹出登录对话框并且可点击登录
wait=WebDriverWait(driver,10)
wait.until(EC.presence_of_element_located((By.XPATH,'//a[@node-type="submitBtn"]')))

#模拟输入微博账号密码
driver.find_element_by_xpath('//input[@node-type="username"]').send_keys(user_name)
time.sleep(2)
driver.find_element_by_xpath('//input[@node-type="password"]').send_keys(password)
time.sleep(2)
driver.find_element_by_xpath('//a[@node-type="submitBtn"]').click()

Step4：处理Ajax加载，模拟执行JavaScript，下拉到底3次后网页即可全部加载出来

#将滚动条移动到页面的底部（重复3次）
js="var q=document.documentElement.scrollTop=100000"  
for i in range(3):   
    driver.execute_script(js)  
    time.sleep(3)

Step5：提取日期、文本、图片URL、转发数、评论数、点赞数等信息

以第一条微博为例：

date=driver.find_element_by_xpath('//div[@action-data="cur_visible=0"][1]/div/div[@class="WB_detail"]/div[2]/a').get_attribute('title')
text=''.join(tuple(driver.find_element_by_xpath('//div[@action-data="cur_visible=0"][1]/div/div[@class="WB_detail"]/div[4]').text.strip()))
image=''.join(driver.find_element_by_xpath('//div[@action-data="cur_visible=0"][1]/div/div[@class="WB_detail"]/div[6]//img[1]').get_attribute('src').strip())
forward=driver.find_element_by_xpath('//div[@action-data="cur_visible=0"][1]/div[2]//li[2]//em[2]').text
comment=driver.find_element_by_xpath('//div[@action-data="cur_visible=0"][1]/div[2]//li[3]//em[2]').text
like=driver.find_element_by_xpath('//div[@action-data="cur_visible=0"][1]/div[2]//li[4]//em[2]').text

结果如下：

在这里插入图片描述

Step6：提取每条微博的评论人ID与评论内容

1. 获取评论内容必须要点击评论按钮；
2. 评论按钮可点击的条件是该按钮在当前页面上可见

#定位到需要爬取评论内容的微博，使得评论按钮可见可点击
comment_button = driver.find_element_by_xpath('//div[@action-data="cur_visible=0"][1]/div/div[@class="WB_detail"]/div[6]//img')
driver.execute_script("arguments[0].scrollIntoView();", comment_button)
time.sleep(3)
#模拟点击评论，将评论内容展开
driver.find_element_by_xpath('//div[@action-data="cur_visible=0"][1]/div[2]//li[3]//em[2]').click()
time.sleep(3)

#获取网页源代码，解析得到评论人ID
page_source=driver.page_source
html=etree.HTML(page_source) 
comment_id=html.xpath('//div[@action-data="cur_visible=0"][1]//div[@node-type="replywrap"][1]/div[@class="WB_text"][1]/a[1]/@usercard')

#解析得到评论内容，和ID拼接起来
user_comment=''
for j in range(len(comment_id)):
    comments=''.join(html.xpath('//div[@action-data="cur_visible=0"][1]//div[{}]/div[@node-type="replywrap"][1]/div[@class="WB_text"][1]/text()'.format(j+1))).strip()
    user_comment+=comment_id[j]+comments+'\n'

代码封装

from selenium.webdriver.common.action_chains import ActionChains
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from lxml import etree
import pandas as pd

def open_url(url):    
    driver.get(url)
    time.sleep(2)
    driver.maximize_window()  #窗口最大化

def login(user_name,password):
    ##设置显示等待，直到右上方登陆按钮可以点击
    wait=WebDriverWait(driver,10)
    wait.until(EC.presence_of_element_located((By.XPATH,'//a[@node-type="loginBtn"]')))
    
    #定位到登陆按钮，点击
    driver.find_element_by_xpath('//a[@node-type="loginBtn"]').click()
    
    #设置显示等待，直到弹出登录对话框并且可点击登录
    wait=WebDriverWait(driver,10)
    wait.until(EC.presence_of_element_located((By.XPATH,'//a[@node-type="submitBtn"]')))
    
    #模拟输入微博账号密码
    driver.find_element_by_xpath('//input[@node-type="username"]').send_keys(user_name)
    time.sleep(2)
    driver.find_element_by_xpath('//input[@node-type="password"]').send_keys(password)
    time.sleep(2)
    driver.find_element_by_xpath('//a[@node-type="submitBtn"]').click()

#将滚动条移动到页面的底部（重复3次）
def tobottom(times):
    js="var q=document.documentElement.scrollTop=100000"  
    for i in range(times):   
        driver.execute_script(js)  
        time.sleep(3)

#将滚动条移动到页面的顶部
def totop():
    js="var q=document.documentElement.scrollTop=0"  
    driver.execute_script(js)  
    time.sleep(3) 

#解析
def parse_item():
    page_source0=driver.page_source
    html0=etree.HTML(page_source0)
    items=html0.xpath('//div[@action-data="cur_visible=0"]')
    for i in range(1,len(items)+1,1):
        result={}
        date=driver.find_element_by_xpath('//div[@action-data="cur_visible=0"][{}]/div/div[@class="WB_detail"]/div[2]/a'.format(i)).get_attribute('title')
        text=''.join(tuple(driver.find_element_by_xpath('//div[@action-data="cur_visible=0"][{}]/div/div[@class="WB_detail"]/div[4]'.format(i)).text.strip()))
       
        try:
            image=''.join(driver.find_element_by_xpath('//div[@action-data="cur_visible=0"][{}]/div/div[@class="WB_detail"]/div[6]//img[1]'.format(i)).get_attribute('src').strip())
        except:
            image=''
        
        try:
            forward=driver.find_element_by_xpath('//div[@action-data="cur_visible=0"][{}]/div[2]//li[2]//em[2]'.format(i)).text
        except:
            forward=''
        
        try:
            comment=driver.find_element_by_xpath('//div[@action-data="cur_visible=0"][{}]/div[2]//li[3]//em[2]'.format(i)).text
        except:
            comment=''
        
        try:
            like=driver.find_element_by_xpath('//div[@action-data="cur_visible=0"][{}]/div[2]//li[4]//em[2]'.format(i)).text
        
        except:
            like=''
            
        #定位评论    
        try:
            #定位到需要爬取评论内容的微博，使得评论按钮可见可点击
            comment_button = driver.find_element_by_xpath('//div[@action-data="cur_visible=0"][{}]/div/div[@class="WB_detail"]/div[6]//img'.format(i))
            driver.execute_script("arguments[0].scrollIntoView();", comment_button)
            time.sleep(3)
            
            #模拟点击评论，将评论内容展开
            driver.find_element_by_xpath('//div[@action-data="cur_visible=0"][{}]/div[2]//li[3]//em[2]'.format(i)).click()
            time.sleep(3)
            
            #获取网页源代码，解析得到评论人ID
            page_source=driver.page_source
            html=etree.HTML(page_source) 
            comment_id=html.xpath('//div[@action-data="cur_visible=0"][{}]//div[@node-type="replywrap"][1]/div[@class="WB_text"][1]/a[1]/@usercard'.format(i))

            #解析得到评论内容，和ID拼接起来
            user_comment=''
            for j in range(len(comment_id)):
                comments=''.join(html.xpath('//div[@action-data="cur_visible=0"][{}]//div[{}]/div[@node-type="replywrap"][1]/div[@class="WB_text"][1]/text()'.format(i,j+1))).strip()
                user_comment+=comment_id[j]+comments+'\n'
        except:
            user_comment=''
        
        #使用生成器返回数据
        result['date']=date
        result['text']=text
        result['image']=image
        result['forward']=forward
        result['comment']=comment
        result['user_comment']=user_comment
        yield result
            
#翻页
def next_page():
    next_page=driver.find_element_by_xpath('//a[@class="page next S_txt1 S_line1"]')
    next_page.click()

if __name__ == '__main__':
    url='https://weibo.com/njnusun?'#网页入口
    USER_NAME='xxxxx'         #微博账号
    PASSWORD='xxxxx'             #微博密码
    driver = webdriver.Chrome(r"C:\Users\ThinkPad\AppData\Local\Google\Chrome\Application\chromedriver.exe")
    
    #设置需要爬取的页数
    PAGES=3
    results=[]
    open_url(url)
    time.sleep(5)
    login(USER_NAME,PASSWORD)
    for i in range(PAGES):
        time.sleep(5)
        tobottom(3)
        time.sleep(3)
        totop()
        result=parse_item()
        for item in result:
            results.append(item)
            print(item)
        next_page()
        time.sleep(3)

徐尚

关注

3
点赞
踩
25

收藏

觉得还不错? 一键收藏
0
评论
微博爬虫一（Selenium）

背景一般企业做舆情分析，新浪微博是必不可少需要关注的。看看有没有负面消息尽早介入处理。人工查找筛选这些信息显然效率不够高，不够智能；毕业以后，很少主动去关注母校的一些情况。借学习爬虫的机会，看看母校最近发生了什么。目标如下图，准备爬取母校微博账号【南京师范大学】近期发布的内容与互动情况。爬取的数据包括：微博发布的时间；微博的文字内容；图片信息（URL）；转发数量；评论数量，评论人ID与评论内容；点赞数量探索下拉网页，可以看到微博内容是Ajax动态加载的，细心探
复制链接

扫一扫