scrapy 爬虫-完整代码

spiders下seventeen.py代码:

import scrapy
from ..items import SeventeenNewsItem
import time


class SeventeenSpider(scrapy.Spider):
    name = 'seventeen'
    allowed_domains = ['17173.com']
    start_urls = ['http://news.17173.com/']

    def start_requests(self):
        for i in range(0, 150):
            if i == 0:
                print(self.start_urls[0] + "index" + ".shtml", '第0页')
                yield scrapy.Request(url=self.start_urls[0] + "index" + ".shtml")
            else:
                yield scrapy.Request(url=self.start_urls[0] + "index_" + str(i) + ".shtml")

    def parse(self, response):
        items = []
        for each in response.xpath("//li[@class='item']"):
            item = SeventeenNewsItem()
           
            seventeen_id = each.xpath("@data-key").extract_first()
            post_title = each.xpath("div[@class='item-con']/div[@class='text']/div[@class='tit']/a/text()").extract_first()
            post_cover_image = each.xpath("div[@class='item-con']/div[@class='pic']/a/img/@style").extract_first()
            post_target_url = each.xpath("div[@class='item-con']/div[@class='text']/div[@class='tit']/a/@href").extract_first()
            update_time_s = '2020-09-18 10:00:00'
            create_time = time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time()))
            
            if post_title:
                seventeen_id = int(seventeen_id.split('_')[:1][0])
                item['seventeen_id'] = seventeen_id
                item['post_title'] = post_title
                item['post_cover_image'] = post_cover_image
                item['post_target_url'] = post_target_url
                item['update_time_s'] = update_time_s
                item['create_time'] = create_time

                # yield scrapy.Request(item["href"], callback=self.parse_detail, meta={"item": item},)
                yield scrapy.Request(item["post_target_url"], callback=self.parse_detail, dont_filter=True, meta={"item": item})
                items.append(item)

        # 返回数据
        return items

    def parse_detail(self, response):
        item = response.meta["item"]
        item['create_time_s'] = response.xpath("//div[@class='gb-final-mod-info']/span[@class='gb-final-date']/text()").extract_first()
        print(item['create_time_s'],999)
        yield item

pipelines.py文件代码:

import pymysql

class SeventeenNewsPipeline:

    def process_item(self, item, spider):
        return item

    def __init__(self):
        # 连接
        self.connect = pymysql.connect(host='localhost', user='root', password='root', db='test', port=3306)
        self.cursor = self.connect.cursor()

    def process_item(self, item, spider):
        # 查重处理
        self.cursor.execute('select * from lb_posts_seventeen where seventeen_id=%s', item['seventeen_id'])
        res = self.cursor.fetchone()
        if res:
            pass
        else:
            # 往数据库里面写入数据
            self.cursor.execute(
                'insert into lb_posts_seventeen(seventeen_id, post_title,  post_cover_image, post_target_url,create_time_s,update_time_s, create_time) VALUES ("{}","{}","{}","{}","{}","{}","{}")'
                    .format(item['seventeen_id'], item['post_title'], item['post_cover_image'], item['post_target_url'], item['create_time_s'], item['update_time_s'], item['update_time_s']))
        self.connect.commit()
        return item

    # 关闭数据库
    def close_spider(self, spider):
        self.cursor.close()
        self.connect.close()

items.py文件代码:

import scrapy


class SeventeenNewsItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    seventeen_id = scrapy.Field()
    post_title = scrapy.Field()
    post_cover_image = scrapy.Field()
    post_target_url = scrapy.Field()
    create_time_s = scrapy.Field()
    update_time_s = scrapy.Field()
    create_time = scrapy.Field()

setting.py下相关:

BOT_NAME = 'seventeen_news'

SPIDER_MODULES = ['seventeen_news.spiders']
NEWSPIDER_MODULE = 'seventeen_news.spiders'
ROBOTSTXT_OBEY = False
ITEM_PIPELINES = {
   'seventeen_news.pipelines.SeventeenNewsPipeline': 300,
}

源码下载地址(免费资源): https://download.csdn.net/download/huha666/12859743

 

课程目标 《从零开始学Scrapy网络爬虫》从零开始,循序渐进地介绍了目前流行的网络爬虫框架Scrapy。即使你没有任何编程基础,学习起来也不会有压力,因为我们有针对性地介绍了Python编程技术。另外,《从零开始学Scrapy网络爬虫》在讲解过程中以案例为导向,通过对案例的不断迭代、优化,让读者加深对知识的理解,并通过14个项目案例,提高学习者解决实际问题的能力。 适合对象 爬虫初学者、爬虫爱好者、高校相关专业的学生、数据爬虫工程师。 课程介绍 《从零开始学Scrapy网络爬虫》共13章。其中,第1~4章为基础篇,介绍了Python基础、网络爬虫基础、Scrapy框架及基本的爬虫功能。第5~10章为进阶篇,介绍了如何将爬虫数据存储于MySQL、MongoDB和Redis数据库中;如何实现异步AJAX数据的爬取;如何使用Selenium和Splash实现动态网站的爬取;如何实现模拟登录功能;如何突破反爬虫技术,以及如何实现文件和图片的下载。第11~13章为高级篇,介绍了使用Scrapy-Redis实现分布式爬虫;使用Scrapyd和Docker部署分布式爬虫;使用Gerapy管理分布式爬虫,并实现了一个抢票软件的综合项目。       由于目标网站可能会对页面进行改版或者升级反爬虫措施,如果发现视频中的方法无法成功爬取数据,敬请按照页面实际情况修改XPath的路径表达式。视频教程主要提供理论、方法支撑。我们也会在第一时间更新源代码,谢谢! 课程特色
©️2020 CSDN 皮肤主题: 大白 设计师:CSDN官方博客 返回首页