知乎爬取文章，问答(永久有效)

比企谷八幡

已于 2024-07-12 16:44:38 修改

阅读量288

点赞数 10

文章标签：人工智能 python 网络爬虫

于 2024-07-10 15:06:18 首次发布

本文链接：https://blog.csdn.net/m0_73802120/article/details/140324429

版权

知乎搜索爬虫

导入库

直接运行看看少什么module然后安装就行

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import copy
import re
import pandas as pd
import openpyxl  
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
import time
import numpy as np

前置准备

需要安装google浏览器
通过命令行启动浏览器并且指定启动的端口
我这里指定的启动端口为9222
具体的操作直接取网上搜索selenium 接管已经打开的浏览器即可

在Ubuntu系统下的操作如下

首先确保自己安装了google浏览器

--proxy-server=host:port

就比如我想要指定端口为9222

google-chrome --remote-debugging-port=9222

之后就可以直接运行以下代码

driver_path="/bin/google-chrome"
chrome_options = Options()
chrome_options.add_experimental_option("debuggerAddress", "127.0.0.1:9222")
chrome_driver = "/bin/google-chrome"
driver = webdriver.Chrome(options=chrome_options)
print(driver.title)

There was an error managing chromedriver (error sending request for url (https://googlechromelabs.github.io/chrome-for-testing/known-good-versions-with-downloads.json)); using driver found in the cache


必应

爬虫部分

deal函数用于传入一个html对象，并对这个对象做处理，提取信息

def deal(item):
    tmp={}
    position=str(item.location['y']-item.size['height'])
    driver.execute_script("window.scrollTo(0,"+position+")")
    more_button=item.find_element(by=By.CLASS_NAME,value='Button.ContentItem-more.FEfUrdfMIKpQDJDqkjte.Button--plain.fEPKGkUK5jyc4fUuT0QP')
    more_button.click()
    time.sleep(3)
    ####
    try:
        tmp['title']=item.find_element(by=By.CLASS_NAME,value='ContentItem-title').text
    except Exception as e:
        tmp['title']=str(item.text).split('\n')[0]  
    try:
        tmp['author']=item.find_element(by=By.CLASS_NAME,value='UserLink.AuthorInfo-name').text
    except Exception as e:
        tmp['author']=None  
    try:
        tmp['content']=re.sub('\s+','',str(item.find_element(by=By.CLASS_NAME,value='RichContent-inner').text))  
    except Exception as e:
        tmp['content']=re.sub('\s+','',str(''.join(str(item.text).split('\n')[1:])))
    try:
        tmp['time']=str(item.find_element(by=By.CLASS_NAME,value='ContentItem-time').text).replace('发布于','')
        tmp['time']=tmp['time'].replace('编辑于','')
    except Exception as e:
        tmp['time']='2024-01-01 00:00' 
    try: 
        tmp['up_count']=int(re.search(r'\d+',str(item.find_element(by=By.CLASS_NAME,value='Button.VoteButton.VoteButton--up.FEfUrdfMIKpQDJDqkjte').text)).group())
    except Exception as e:
        tmp['up_count']=0
    #### 
    down_position=str(item.location['y'])
    driver.execute_script("window.scrollTo(0,"+str(down_position)+")")
    time.sleep(1)
    less_button=item.find_element(by=By.CLASS_NAME,value='RichContent-collapsedText') 
    less_button.click() 
    return tmp

流水线

传入一个字符串代表要输入的搜索内容
运行此函数可以直接将搜索的所有内容保存至表格文件中

def deal_search(content:str):
    df_ori=pd.read_excel('result.xlsx')
    df_ori=df_ori[df_ori.columns[1:]]
    articles=[]
    search_input=driver.find_element(by=By.XPATH,value="/html/body/div[1]/div/div[2]/header/div[1]/div[1]/div/form/div/div/label/input")
    search_button=driver.find_element(by=By.XPATH,value="/html/body/div[1]/div/div[2]/header/div[1]/div[1]/div/form/div/div/label/button")
    search_input.send_keys(content)
    search_button.click()
    time.sleep(5)
    root=driver.find_element(by=By.ID,value='root')
    root.click()
    for i in range(25):
        driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")
        time.sleep(0.5)
    time.sleep(3)
    re_list=driver.find_elements(by=By.CLASS_NAME,value="List-item")
    for item in re_list:
        try:
            articles.append(deal(item))
        except Exception as e:
            continue
    df=pd.DataFrame(articles)  
    df=pd.concat([df_ori,df],axis=0)  
    df.to_excel('result.xlsx')  
    search_input=driver.find_element(by=By.XPATH,value="/html/body/div[1]/div/div[2]/header/div[1]/div[1]/div/form/div/div/label/input")
    for i in range(len(content)):
        search_input.send_keys(Keys.BACK_SPACE)

关键词列表

可以个根据需要修改关键词列表

contents=['乡村振兴','大学生返乡发展','青年返乡发展','乡村复兴','发展农村','农村创业','乡村创业','如何发展乡村经济','农村经济发展','山村经济发展','青年返乡','毕业生农村','青年乡村','大学生乡村']

根据关键词列表爬去对应的内容

for ind in range(len(contents)):
    try:
        deal_search(contents[ind])
    except Exception as e:
        print(contents[ind]+" failed") 
        ind-=1
        driver.refresh()
        time.sleep(10)
        search_input=driver.find_element(by=By.XPATH,value="/html/body/div[1]/div/div[2]/header/div[1]/div[1]/div/form/div/div/label/input")
        for i in range(10):
            search_input.send_keys(Keys.BACK_SPACE)
    time.sleep(5)

df=pd.read_excel('./result.xlsx')

根据title中是否含有问号筛选出问题和文章分别保存至`articles.xlsx`和`answer.xlsx`

df=pd.read_excel('./result.xlsx')
df=df[df.columns[1:]]
bools=df.duplicated(subset=None, keep='first')
df_unique=df[[not i for i in bools]].reset_index(drop=True)
question=[True if "?" in i or "？" in i else False for i in df_unique['title']]
df_articles=df_unique[[not i for i in question]].reset_index(drop=True)
df_answers=df_unique[question].reset_index(drop=True)
df_answers.to_excel("./answers.xlsx")
df_articles.to_excel("./articles.xlsx")
df_unique.to_excel('result_unique.xlsx')  
print(bools)

0       False
1       False
2       False
3       False
4       False
        ...  
2598    False
2599     True
2600    False
2601    False
2602    False
Length: 2603, dtype: bool