使用前需安装一些库和模块,包括requests、BeautifulSoup、webdriver。
pip install requests
pip install beautifulsoup4
pip install selenium
知乎回答需要用户下拉浏览器才能进行加载,使用webdriver进行创建浏览器实例进行模拟用户下拉;
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
from selenium.webdriver.common.by import By
def get_answers(url):
#火狐浏览器
driver = webdriver.Firefox()
driver.get(url)
#下拉次数9
for _ in range(10):
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(2)
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
soup = BeautifulSoup(driver.page_source, 'html.parser')
answers = soup.find_all('div', class_='RichContent-inner')
for answer in answers:
print(answer.get_text()+'\n')
with open("D:\\cc\\1.txt", "a", encoding="utf-8") as f:
f.write(answer.get_text()+'\n\n')
driver.quit()
if __name__ == '__main__':
url = 'https://www.zhihu.com/question/346740353'
# url = input("请输入要抓取的知乎网址:")
get_answers(url)
将抓取的回答存入D:\\cc\\1.txt文件中,下面url可以自己切换。
ps:仅供学习可用,因知乎反爬机制,可能想要登录才能访问。