上课的作业,备份一下,以免不时之需。
知乎的问题的网页都是 https://www.zhihu.com/question/ 带8位神秘数字,我们只需要依次遍历就解决问题啦,遇到404的情况就直接跳过。用scrapy框架快速开发。
例子:https://www.zhihu.com/question/22913650
获取知乎问题标题的代码
title = response.selector.xpath("/html/head/title[1]/text()").extract_first()[0:-5]
提取问题标签的代码
head_list = response.css("#root > div > main > div > meta:nth-child(3)").xpath("@content").extract_first().split()
获取第一个问题点赞数的代码
praise_num = response.css("#QuestionAnswers-answers > div > div > div:nth-child(2) > div > div:nth-child(1) > div > meta:nth-child(3)").xpath("@content").extract_first()
scapy的spider代码:
# -*- coding: utf-8 -*-
import scrapy
import re
class ZhihuqSpider(scrapy.Spider):
name = 'zhihuq'
allowed_domains = ["www.zhihu.com"]
start_urls = ['https://www.zhihu.com/question/22913650']
def parse(self, response):
#提取标题
title = response.selector.xpath("/html/head/title[1]/text()").extract_first()[0:-5]
if title and (title!="安全验证"):
#提取标签
head_list = response.css("#root > div > main > div > meta:nth-child(3)").xpath("@content").extract_first().split()
#获取点赞数
praise_num = response.css("#QuestionAnswers-answers > div > div > div:nth-child(2) > div > div:nth-child(1) > div > meta:nth-child(3)").xpath("@content").extract_first()
# if int(praise_num) > 100 :
yield{
'title':title,
'head_list':head_list,
'praise_num':praise_num
}
def start_requests(self):
url_base = r"https://www.zhihu.com/question/"
start_index = 0;
#读取上次一结束的位置
with open("count.txt","r") as f:
start_index = int(f.read());
for i in range(start_index,99999999):
url = url_base + str(i)
#时刻写入正在读取的位置,这段代码有很大问题,会不断的打开关闭文件,不过可以刚好当作一个延时使用
with open("count.txt","w") as f:
f.write(str(i))
yield scrapy.Request(url,callback=self.parse)
修改setting.py中的设置:
请求头
DEFAULT_REQUEST_HEADERS = {
"Host": "www.zhihu.com",
"Connection": "keep-alive",
"Cache-Control": "max-age=0",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36",
"Referer": "http://www.zhihu.com/people/raymond-wang",
"Accept-Encoding": "gzip,deflate,sdch",
"Accept-Language": "zh-CN,zh;q=0.8,en-US;q=0.6,en;q=0.4,zh-TW;q=0.2",
}
禁止robot协议:
COOKIES_ENABLED = False
设置延时:
DOWNLOAD_DELAY = 0.1