python知乎爬虫收藏夹_Python爬取知乎问题收藏夹爬虫入门

最新推荐文章于 2022-12-10 17:33:15 发布

weixin_42128015

最新推荐文章于 2022-12-10 17:33:15 发布

阅读量1k

点赞数 3

文章标签： python知乎爬虫收藏夹

本文链接：https://blog.csdn.net/weixin_42128015/article/details/113985803

版权

本文介绍了如何使用Python和BeautifulSoup爬取知乎热门问题的数据，包括问题、答主信息、赞数、评论数等，并将数据存储到MongoDB数据库。初学者可通过此实例掌握基础爬虫技术和数据库操作。

摘要由CSDN通过智能技术生成

简介

知乎的网站是比较好爬的，没有复杂的反爬手段，适合初学爬虫的人作为练习

因为刚刚入门python，所以只是先把知乎上热门问题的一些主要信息保存到数据库中，待以后使用这些信息进行数据分析，爬取的网页链接是赞同超过1000的回答

网页分析

1.分析网站的页面结构

准备提取热门问题的问题、答主、赞数、评论数等内容

界面分析

2.分析网站的元素

选择页面中需要爬取的内容对应的元素，分析特征(class,id等)，稍后使用BeautifulSoap爬取这些内容

HTML分析

3.用Beautifulsoup解析获取的网页

这些网页的url的数字是递增的，拼接字符串就可以得到网页的链接了

url_part = "https://www.zhihu.com/collection/19928423?page=" # 赞数超过一千的收藏夹

url = url_part + str(i) # 拼接知乎爬取链接

用BeautifulSoap解析部分的代码

def find_answers(url, collection):

get_html = requests.get(url, headers=Web.headers) # requests请求页面内容

soup = BeautifulSoup(get_html.text, 'lxml') # BeautifulSoup解析页面内容

items = soup.find_all('div', class_="zm-item") # 获取所有的热门问题内容

success = 0

error = 0

for item in items:

try:

data = store_answer(item)

collection.insert(data) # 插入到数据表中

except AttributeError as e:

error += 1 # 发生错误

else:

success += 1

def store_answer(answer):

data = {

"title": answer.find("h2", class_="zm-item-title").text, # 问题题目

"like_num": answer.find("div", class_="zm-item-vote").text, # 问题赞数

"answer_user_name": answer.find("div", class_="answer-head").find("span", class_="author-link-line").text, # 答主姓名

"answer_user_sign": answer.find("div", class_="answer-head").find("span", class_="bio").text, # 答主签名

"answer": answer.find("div", class_="zh-summary summary clearfix").text, # 问题摘要

"time": answer.find("p", class_="visible-expanded").find("a", class_="answer-date-link meta-item").text, # 问题编辑时间

"comment": answer.find("div", class_="zm-meta-panel").find("a",

class_="meta-item toggle-comment js-toggleCommentBox").text,

# 问题评论数

"link": answer.find("link").get("href") # 问题链接

}

return data

4.完整代码

import time # 计算程序时间所用的库

import requests # 获取页面所用的库

from bs4 import BeautifulSoup # 提取页面所用的库

from pymongo import MongoClient # 连接数据库所用的库

class Web:

headers = {'User-Agent': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/"

"56.0.2924.87 Safari/537.36"} # 请求头

url_part = "https://www.zhihu.com/collection/19928423?page=" # 赞数超过一千的收藏夹

def get_collection():

client = MongoClient('mongodb://localhost:27017/') # 连接到Mongodb

db = client.data # 打开数据库 "data"(数据库名称可以自己修改)

collection = db.zhihu # 打开表 "zhihu"(表名称可以自己修改)

return collection