超简易Scrapy爬取知乎问题,标签的爬虫

上课的作业,备份一下,以免不时之需。

知乎的问题的网页都是 https://www.zhihu.com/question/ 带8位神秘数字,我们只需要依次遍历就解决问题啦,遇到404的情况就直接跳过。用scrapy框架快速开发。

例子:https://www.zhihu.com/question/22913650

获取知乎问题标题的代码

title = response.selector.xpath("/html/head/title[1]/text()").extract_first()[0:-5]

提取问题标签的代码

head_list = response.css("#root > div > main > div > meta:nth-child(3)").xpath("@content").extract_first().split()

获取第一个问题点赞数的代码

praise_num = response.css("#QuestionAnswers-answers > div > div > div:nth-child(2) > div > div:nth-child(1) > div > meta:nth-child(3)").xpath("@content").extract_first()

scapy的spider代码:

# -*- coding: utf-8 -*-
import scrapy
import re


class ZhihuqSpider(scrapy.Spider):

    name = 'zhihuq'
    allowed_domains = ["www.zhihu.com"]
    start_urls = ['https://www.zhihu.com/question/22913650']




    def parse(self, response):
    	
    	#提取标题
    	title = response.selector.xpath("/html/head/title[1]/text()").extract_first()[0:-5]
    	if title and (title!="安全验证"):
    		#提取标签
    		head_list = response.css("#root > div > main > div > meta:nth-child(3)").xpath("@content").extract_first().split()
    		#获取点赞数
    		praise_num = response.css("#QuestionAnswers-answers > div > div > div:nth-child(2) > div > div:nth-child(1) > div > meta:nth-child(3)").xpath("@content").extract_first()
    	#	if int(praise_num) > 100 :
    		yield{
   				'title':title,
   				'head_list':head_list,
   				'praise_num':praise_num
   			}

    def start_requests(self):

    	url_base = r"https://www.zhihu.com/question/"
    	start_index = 0;
    	#读取上次一结束的位置
    	with open("count.txt","r") as f:
    		start_index = int(f.read());
    	for i in range(start_index,99999999):
    		url = url_base + str(i)
    		#时刻写入正在读取的位置,这段代码有很大问题,会不断的打开关闭文件,不过可以刚好当作一个延时使用
    		with open("count.txt","w") as f:
    			f.write(str(i))
    		yield scrapy.Request(url,callback=self.parse)

修改setting.py中的设置:

请求头

DEFAULT_REQUEST_HEADERS = {
    "Host": "www.zhihu.com",
    "Connection": "keep-alive",
    "Cache-Control": "max-age=0",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36",
    "Referer": "http://www.zhihu.com/people/raymond-wang",
    "Accept-Encoding": "gzip,deflate,sdch",
    "Accept-Language": "zh-CN,zh;q=0.8,en-US;q=0.6,en;q=0.4,zh-TW;q=0.2",
}

禁止robot协议:

COOKIES_ENABLED = False

设置延时:

DOWNLOAD_DELAY = 0.1
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值