Scrapy爬虫Demo

最新推荐文章于 2022-02-25 15:25:09 发布

CN_CodeLab

最新推荐文章于 2022-02-25 15:25:09 发布

阅读量371

点赞数

分类专栏： python 文章标签：爬虫 python scrapy

本文链接：https://blog.csdn.net/sinat_25449961/article/details/52212059

版权

python 专栏收录该内容

33 篇文章 2 订阅

订阅专栏

#coding=utf-8
import scrapy
import time
import re
from qqcrawler.items import QqcrawlerItem


class QzoneSpider(scrapy.Spider):
    name = "qzone"
    # allowed_domains = ["qzone.qq.com/"]
    start_urls = [
        # "http://www.ncst.edu.cn/"
        "http://qzone.qq.com/"
        # ,"http://www.qq.com/"
    ]
    def parse(self, response):
        try:
            qq_item = QqcrawlerItem()   #爬取的数据                       
            qq_item['c_time'] = time.time()
            qq_item['url'] = response.url
            if response.xpath('/html/head/title'):
                qq_item['title'] = response.xpath('/html/head/title').extract()
            else:
                qq_item['title']=None
            yield qq_item
            if response.xpath('//@href'):
                for i in response.xpath('//@href').extract():
                    if re.match('^http.*qzone\.qq.*',i):
                        print i,'================'
                        yield scrapy.Request(i, callback=self.parse)    #继续向爬虫中添加url
        except:
            print ''

CN_CodeLab

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Scrapy爬虫Demo

#coding=utf-8import scrapyimport timeimport refrom qqcrawler.items import QqcrawlerItemclass QzoneSpider(scrapy.Spider): name = "qzone" # allowed_domains = ["qzone.qq.com/"] start_urls
复制链接

扫一扫

专栏目录