所用库
-
requests
-
xpath解析库
-
multiprocessing多进程
-
pymysql数据库操作库
实战背景
主要是爬取知乎热榜的问题及点赞数比较高的答案,通过requests请求库进行爬取,xpath进行解析,并将结果存储至mysql数据库中
爬取的url为:https://www.zhihu.com/hot
源码保存在我的github上:知乎热榜问题及答案数据获取
文章首发于个人网站:大圣的专属空间
代码实现
首先获取问题的标题以及其下属的答案的url地址,代码如下:
import requests
from lxml import etree
import time
import multiprocessing
import pymysql
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36',
'cookie': '***********'
}
url = 'https://www.zhihu.com/hot'
def get_question_num(url,headers):
response = requests.get(url,headers=headers)
text = response.text
html = etree.HTML(text)
reslut = html.xpath("//section[@class='HotItem']")
# 获取问题的ID
question_list = []
for question in reslut:
number = question.xpath(".//div[@class='HotItem-index']//text()")[0].strip()
title = question.xpath(".//h2[@class='HotItem-title']/text()")[0].strip()
href = question.xpath(".//div[@class='HotItem-content']/a/@href")[0].strip()
question_num = href.split('/')