scrapy爬虫(1)之xpath

1、创建scrapy项目


$ scrapy startproject ArticleSpider  # ArticleSpider是项目名
  • 终端显示:
    New Scrapy project 'ArticleSpider', using template directory '/Users/chao/.pyenv/versions/2.7.13/lib/python2.7/site-packages/scrapy/templates/project', created in: /Users/.../ArticleSpider 
    You can start your first spider with: 
    cd ArticleSpider 
    scrapy genspider example example.com 

2、默认模板创建


$ cd ArticleSpider
$ scrapy genspider jobbole blog.jobbole.com
  • 即在spider目录下创建了jobbole.py文件,终端显示:
    Created spider 'jobbole' using template 'basic' in module: ArticleSpider.spiders.jobbole 

3、修改setting.py


# Obey robots.txt rules
ROBOTSTXT_OBEY = False  ## True的话会过滤掉不符合协议的url,容易爬不到网页

4、建立main.py(方便在pycharm上断点调试)


  • 在ArticleSpide目录下创建,内容如下
# coding:utf-8

from scrapy.cmdline import execute
import sys
import os

sys.path.append(os.path.dirname(os.path.abspath(__file__)))
#print(os.path.abspath(__file__))  # 自身路径
#print(os.path.dirname(os.path.abspath(__file__)))  # 父目录路径

execute(["scrapy", "crawl", "jobbole"])
  • 即用运行main.py来代替爬虫启动命令:
$ scrapy crawl jobbole

5、调试xpath


$ scrapy shell http://blog.jobbole.com/112558/ 
  • 即进入python shell的调试环境,避免每次启动main.py都会解析url而浪费时间。
>>> title = response.xpath("//*[@id='post-112558']/div[1]/h1/text()") >>> title [] >>> title.extract() 
# title是一个数组:[u'\u90fd100%\u4ee3\u7801\u8986\u76d6\u4e86\uff0c\u8fd8\u4f1a\u6709\u4ec0\u4e48\u95ee\u9898\uff1f'] 
>>> title.extract()[0] 
# 取出数组元素:u'\u90fd100%\u4ee3\u7801\u8986\u76d6\u4e86\uff0c\u8fd8\u4f1a\u6709\u4ec0\u4e48\u95ee\u9898\uff1f'

>>> print(title.extract())  # 可以显示中文

6、调试正确的内容写进jobbole.py


# -*- coding: utf-8 -*-
import scrapy
import re

import sys
reload(sys)
sys.setdefaultencoding('utf-8')  # 消除UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 0

class JobboleSpider(scrapy.Spider):
    name = 'jobbole'
    allowed_domains = ['blog.jobbole.com']
    start_urls = ['http://blog.jobbole.com/112558/']

    def parse(self, response):
        # 提取文章的具体字段
        # xpath:
        title = response.xpath("//*[@id='post-112558']/div[1]/h1/text()").extract()[0]  # 1.浏览器复制xpath地址
        create_data = response.xpath('//*[@id="post-112558"]/div[2]/p/text()').extract()[0].strip().replace('·', '')
        praise_nums = int(response.xpath('//span[contains(@class, "href-style vote-post-up")]/h10/text()').extract()[0])  # 2.span扫描关键字
        favor_nums = response.xpath('//span[contains(@class, "bookmark-btn")]/text()').extract_first("")  # extract_first()等于extract()[0]且避免了空数组异常问题,参数是默认返回值
        match_re = re.match(".*?(\d+).*", favor_nums)  # 正则表达式
        if match_re:
            favor_nums = int(match_re.group(1))
        else:
            favor_nums = 0
        comment_nums = response.xpath('//*[@id="post-112558"]/div[3]/div[4]/a/span/text()').extract()[0]
        match_re2 = re.match(".*?(\d+).*", comment_nums)
        if match_re2:
            comment_nums = int(match_re2.group(1))
        else:
            comment_nums = 0
        content = response.xpath('//div[@class="entry"]').extract()[0]  # 3.'//标签名[class="..."]'
        create_data = response.xpath('//*[@id="post-112558"]/div[2]/p/text()').extract()[0].strip().replace('·', '')
        tag_list = response.xpath('//p[@class="entry-meta-hide-on-mobile"]/a/text()').extract()
        [element for element in tag_list if not element.strip().endswith("评论")]  # 去重,去掉和comment_nums一样的内容
        tags = ",".join(tag_list)
  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值