使用scrapy抓取人民网体育、社会模块

最新推荐文章于 2021-01-17 01:53:05 发布

fiery_heart

最新推荐文章于 2021-01-17 01:53:05 发布

阅读量602

点赞数

分类专栏：爬虫文章标签： scrapy requests

本文链接：https://blog.csdn.net/fiery_heart/article/details/82348575

版权

爬虫专栏收录该内容

10 篇文章 0 订阅

订阅专栏

分析网站

刚开始看完网站的这两个模块，感觉很麻烦，需要写很多解析函数，写很多规则，对两个模块下的每个小模块逐个进行处理，然后就朝着这个方向开始写，写到一半发现，我不仅需要判断这个模块里有没有图片，还要判断这个模块属不属于图集，感觉应该是自己方向错了，于是就重新观察网站，结合之前写的解析，最后分析发现：
我所需的数据，无非就是文章标题，文章内容，文章里的图。只是页面结构不一样，并且2015年之前的页面，是属于老页面，现有的解析规则并不能够提取出数据，于是，先将所有解析规则合并到一起，然后去爬。将没有拿到title的页面的url存起来，分析为什么没有数据，最后得出：要么是这个网页本身不属于目标网页，要么是解析规则不够全面。对与不属于目标页面的网页，直接无视，然后对那些现有解析规则拿不到的页面，进行新的解析规则的定义，然后合并解析规则，反复多次，终于写出了算是完整规则。

开始写代码

最主要的是爬虫文件，像下载中间件，设置文件之类的，就不列出来了

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from ..items import RenminwangItem
# 在settings文件里定义好爬虫的名字，域，起始url，是因为社会模块和体育模块基本一样，解析规则可以直接拿来用
# 所以将这些东西放到设置里，方便切换
from ..settings import SPIDER_NAME, ALLOWED_DOMAINS, START_URLS


class TiyuSpider(CrawlSpider):
    name = SPIDER_NAME
    allowed_domains = ALLOWED_DOMAINS
    start_urls = START_URLS

    rules = (
        # 因为是要爬取模块下所有文章，所以就不用写链接提取规则了，只要是链接，就拿过来
        # 超出域的范围的，也就是不是体育模块的链接，是会被过滤掉的
        Rule(LinkExtractor(allow=()), follow=True, callback='parse_item'),
    )

    def parse_item(self, response):
        item = RenminwangItem()
        # 标题
        title = response.xpath("//h1/div/text() | //h1/text()").extract()
        self.check(item, title, 'title')
        # 正常文章的内容
        content_ls = response.xpath(
            "//p[@style='text-indent: 2em;']/text() | //p[@style='text-indent: 2em']/text() | //p[@style='text-indent: 2em']/text() | //div[@class='box_con']//p/text() | //div[@id='p_content']/p[not(@style)]/text() | //p[@style='text-indent: 2em;']/text() | //div[@id='p_content'][./text() and ./a//text()]").extract()
        content = ''.join(content_ls)
        # 文章里的插图链接
        inset_ls = response.xpath(
            "//div[@class='text w1000 clearfix']//div[@class='box_pic']//img/@src | //div[@class='box_pic']/following-sibling::p/img/@src | //*[@align='center']/img/@src").extract()
        inset = '+++'.join(inset_ls)

        # self.check(item,content_ls,'content')
        # self.check(item,inset_ls,'inset')

        item['content'] = content
        item['inset'] = inset
        item['url'] = str(response.url)

        return item
    # 这个函数是为了判断title是否为空，title为空的话，就拿出来看一下，看是什么原因导致的
    def check(self, item, type, name):
        if len(type) == 0:
            item[name] = 'kong_{}'.format(name)
        else:
            item[name] = ''.join(type)

因为设置了域，但是有些插图是不属于这个域的，所以要在管道文件，使用requests去下载图片
可以将下载其改成多线程或者异步，速度更快一点。

class DownLoadInsetPipeline(object):
    def process_item(self, item, spider):
        inset_list = item['inset'].split('+++')
        if inset_list[0] != '':
            for inset_url in inset_list:
                if inset_url.startswith('http://'):
                    # 图片的链接是完整的
                    dir_name = (item['url'].split('/')[-1]).replace('.html','')
                    #dir_path = 'E:\img\{}'.format(dir_name)
                    dir_path = BASE_DIR + dir_name
                    if not os.path.exists(dir_path):
                        os.makedirs(dir_path)
                    self.down_img(dir_path,inset_url)
                else:
                    img_url = BASE_URL + inset_url
                    html_name = (item['url'].split('/')[-1]).split('-')
                    if len(html_name) == 2:
                        dir_name = ''.join(html_name).replace('.html','')
                    else:
                        dir_name = ''.join(html_name[:-1])
                    #dir_path = 'E:\img\{}'.format(dir_name)
                    dir_path = BASE_DIR + dir_name
                    if not os.path.exists(dir_path):
                        os.makedirs(dir_path)
                    self.down_img(dir_path,img_url)
        return item
    def down_img(self,dir_path,img_url):
        file_name = dir_path + '\\' + img_url.split('/')[-1]
        resp = requests.get(img_url,timeout=30)
        resp.raise_for_status()
        resp.encoding = resp.apparent_encoding
        with io.open(file_name,'wb') as f:
            f.write(resp.content)

总结

遇到一个网站，开始爬取之前，一定要用结合需求，用心去分析网站，多思考，当整体逻辑出来了，程序自然而然就写出来。

fiery_heart

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
使用scrapy抓取人民网体育、社会模块

分析网站刚开始看完网站的这两个模块，感觉很麻烦，需要写很多解析函数，写很多规则，对两个模块下的每个小模块逐个进行处理，然后就朝着这个方向开始写，写到一半发现，我不仅需要判断这个模块里有没有图片，还要判断这个模块属不属于图集，感觉应该是自己方向错了，于是就重新观察网站，结合之前写的解析，最后分析发现：我所需的数据，无非就是文章标题，文章内容，文章里的图。只是页面结构不一样，并且2015年之前...
复制链接

扫一扫