知乎爬取文章,问答(永久有效)

知乎搜索爬虫

导入库

  • 直接运行看看少什么module然后安装就行
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import copy
import re
import pandas as pd
import openpyxl  
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
import time
import numpy as np

前置准备

  • 需要安装google浏览器
  • 通过命令行启动浏览器并且指定启动的端口
  • 我这里指定的启动端口为9222
  • 具体的操作直接取网上搜索selenium 接管已经打开的浏览器即可

在Ubuntu系统下的操作如下

  • 首先确保自己安装了google浏览器
--proxy-server=host:port
  • 就比如我想要指定端口为9222
google-chrome --remote-debugging-port=9222
  • 之后就可以直接运行以下代码
driver_path="/bin/google-chrome"
chrome_options = Options()
chrome_options.add_experimental_option("debuggerAddress", "127.0.0.1:9222")
chrome_driver = "/bin/google-chrome"
driver = webdriver.Chrome(options=chrome_options)
print(driver.title)
There was an error managing chromedriver (error sending request for url (https://googlechromelabs.github.io/chrome-for-testing/known-good-versions-with-downloads.json)); using driver found in the cache


必应

爬虫部分

  • deal函数用于传入一个html对象,并对这个对象做处理,提取信息
def deal(item):
    tmp={}
    position=str(item.location['y']-item.size['height'])
    driver.execute_script("window.scrollTo(0,"+position+")")
    more_button=item.find_element(by=By.CLASS_NAME,value='Button.ContentItem-more.FEfUrdfMIKpQDJDqkjte.Button--plain.fEPKGkUK5jyc4fUuT0QP')
    more_button.click()
    time.sleep(3)
    ####
    try:
        tmp['title']=item.find_element(by=By.CLASS_NAME,value='ContentItem-title').text
    except Exception as e:
        tmp['title']=str(item.text).split('\n')[0]  
    try:
        tmp['author']=item.find_element(by=By.CLASS_NAME,value='UserLink.AuthorInfo-name').text
    except Exception as e:
        tmp['author']=None  
    try:
        tmp['content']=re.sub('\s+','',str(item.find_element(by=By.CLASS_NAME,value='RichContent-inner').text))  
    except Exception as e:
        tmp['content']=re.sub('\s+','',str(''.join(str(item.text).split('\n')[1:])))
    try:
        tmp['time']=str(item.find_element(by=By.CLASS_NAME,value='ContentItem-time').text).replace('发布于','')
        tmp['time']=tmp['time'].replace('编辑于','')
    except Exception as e:
        tmp['time']='2024-01-01 00:00' 
    try: 
        tmp['up_count']=int(re.search(r'\d+',str(item.find_element(by=By.CLASS_NAME,value='Button.VoteButton.VoteButton--up.FEfUrdfMIKpQDJDqkjte').text)).group())
    except Exception as e:
        tmp['up_count']=0
    #### 
    down_position=str(item.location['y'])
    driver.execute_script("window.scrollTo(0,"+str(down_position)+")")
    time.sleep(1)
    less_button=item.find_element(by=By.CLASS_NAME,value='RichContent-collapsedText') 
    less_button.click() 
    return tmp

流水线

  • 传入一个字符串代表要输入的搜索内容
  • 运行此函数可以直接将搜索的所有内容保存至表格文件中
def deal_search(content:str):
    df_ori=pd.read_excel('result.xlsx')
    df_ori=df_ori[df_ori.columns[1:]]
    articles=[]
    search_input=driver.find_element(by=By.XPATH,value="/html/body/div[1]/div/div[2]/header/div[1]/div[1]/div/form/div/div/label/input")
    search_button=driver.find_element(by=By.XPATH,value="/html/body/div[1]/div/div[2]/header/div[1]/div[1]/div/form/div/div/label/button")
    search_input.send_keys(content)
    search_button.click()
    time.sleep(5)
    root=driver.find_element(by=By.ID,value='root')
    root.click()
    for i in range(25):
        driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")
        time.sleep(0.5)
    time.sleep(3)
    re_list=driver.find_elements(by=By.CLASS_NAME,value="List-item")
    for item in re_list:
        try:
            articles.append(deal(item))
        except Exception as e:
            continue
    df=pd.DataFrame(articles)  
    df=pd.concat([df_ori,df],axis=0)  
    df.to_excel('result.xlsx')  
    search_input=driver.find_element(by=By.XPATH,value="/html/body/div[1]/div/div[2]/header/div[1]/div[1]/div/form/div/div/label/input")
    for i in range(len(content)):
        search_input.send_keys(Keys.BACK_SPACE)

关键词列表

  • 可以个根据需要修改关键词列表
contents=['乡村振兴','大学生返乡发展','青年返乡发展','乡村复兴','发展农村','农村创业','乡村创业','如何发展乡村经济','农村经济发展','山村经济发展','青年返乡','毕业生农村','青年乡村','大学生乡村']

根据关键词列表爬去对应的内容

for ind in range(len(contents)):
    try:
        deal_search(contents[ind])
    except Exception as e:
        print(contents[ind]+" failed") 
        ind-=1
        driver.refresh()
        time.sleep(10)
        search_input=driver.find_element(by=By.XPATH,value="/html/body/div[1]/div/div[2]/header/div[1]/div[1]/div/form/div/div/label/input")
        for i in range(10):
            search_input.send_keys(Keys.BACK_SPACE)
    time.sleep(5)
df=pd.read_excel('./result.xlsx')

根据title中是否含有问号筛选出问题和文章分别保存至articles.xlsxanswer.xlsx

df=pd.read_excel('./result.xlsx')
df=df[df.columns[1:]]
bools=df.duplicated(subset=None, keep='first')
df_unique=df[[not i for i in bools]].reset_index(drop=True)
question=[True if "?" in i or "?" in i else False for i in df_unique['title']]
df_articles=df_unique[[not i for i in question]].reset_index(drop=True)
df_answers=df_unique[question].reset_index(drop=True)
df_answers.to_excel("./answers.xlsx")
df_articles.to_excel("./articles.xlsx")
df_unique.to_excel('result_unique.xlsx')  
print(bools)
0       False
1       False
2       False
3       False
4       False
        ...  
2598    False
2599     True
2600    False
2601    False
2602    False
Length: 2603, dtype: bool
  • 10
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
爬取知乎问答,可以通过以下步骤: 1. 安装 `requests` 和 `beautifulsoup4` 库:打开命令行,输入以下命令安装: ``` pip install requests beautifulsoup4 ``` 2. 打开知乎网站,找到要爬取问答页面,例如:https://www.zhihu.com/question/123456789。 3. 使用 `requests` 库获取该页面的 HTML 内容: ```python import requests url = 'https://www.zhihu.com/question/123456789' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'} response = requests.get(url, headers=headers) html = response.text ``` 4. 使用 `beautifulsoup4` 库解析 HTML 内容,获取问答的标题和内容: ```python from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'html.parser') title = soup.find('h1', class_='QuestionHeader-title').text.strip() content = soup.find('div', class_='QuestionRichText').text.strip() ``` 5. 获取所有回答的内容: ```python answers = [] for answer in soup.find_all('div', class_='List-item'): answer_content = answer.find('div', class_='RichContent-inner').text.strip() answers.append(answer_content) ``` 完整代码示例: ```python import requests from bs4 import BeautifulSoup url = 'https://www.zhihu.com/question/123456789' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'} response = requests.get(url, headers=headers) html = response.text soup = BeautifulSoup(html, 'html.parser') title = soup.find('h1', class_='QuestionHeader-title').text.strip() content = soup.find('div', class_='QuestionRichText').text.strip() answers = [] for answer in soup.find_all('div', class_='List-item'): answer_content = answer.find('div', class_='RichContent-inner').text.strip() answers.append(answer_content) print(title) print(content) print(answers) ```

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值