scrapy猫眼爬虫

提示:文章写完后,目录可以自动生成,如何生成可参考右边的帮助文档


前言

提示:这里可以添加本文要记录的大概内容:
例如:随着人工智能的不断发展,机器学习这门技术也越来越重要,很多人都开启了学习机器学习,本文就介绍了机器学习的基础内容。


提示:以下是本篇文章正文内容,下面案例可供参考

一、要求

在这里插入图片描述

二、使用步骤

1.引入库

代码如下(示例):

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
import  ssl
ssl._create_default_https_context = ssl._create_unverified_context

2.maoyanspider.py

# -*- coding: utf-8 -*-
import scrapy
from ..items import MaoyanItem
import urllib

class MaoyanspiderSpider(scrapy.Spider):
    name = 'maoyanspider'
    allowed_domains = ['maoyan.com']
    start_urls = ['https://maoyan.com/board/4']

    def parse(self, response):
        dls = response.xpath("//dl[@class='board-wrapper']/dd")
        for dl in dls:
            item = MaoyanItem()
            item['name'] = dl.xpath("div[@class='board-item-main']/div[@class='board-item-content']/div[@class='movie-item-info']/p[@class='name']/a/text()").extract_first()
            item['actors'] = dl.xpath("div[@class='board-item-main']/div[@class='board-item-content']/div[@class='movie-item-info']/p[@class='star']/text()").extract_first().strip()
            item['releasetime'] = dl.xpath("div[@class='board-item-main']/div[@class='board-item-content']/div[@class='movie-item-info']/p[@class='releasetime']/text()").extract_first()
            yield item
        next_page = response.xpath('//div[@class="pager-main"]/ul/li/a[contains(text(), "下一页")]/@href').extract_first()
        if next_page is not None:
            new_link = urllib.parse.urljoin(response.url, next_page)
            yield scrapy.Request(new_link, callback=self.parse)


3.items.py

import scrapy


class MaoyanItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()
    actors = scrapy.Field()
    releasetime = scrapy.Field()

4.pipelines.py

import pymysql, csv

class MaoyanPipeline(object):

    def process_item(self, item, spider):
        data_list = [item['name'], item['actors'], item['releasetime']]
        head = ('company', 'salary', 'address', 'experience', 'education', 'number_people')
        with open('maoyan.csv', 'a+', encoding='utf-8', newline='') as file:
            writer = csv.writer(file)
            # writer.writerow(head)  # 写入表头  也就是文件标题
            writer.writerow(data_list)
        return item
class MaoyanMysqlPipeline(object):
    def open_spider(self, spider):
        print('爬虫开始执行')
        self.db = pymysql.connect(host='localhost', user='root',
                                  password='123456', database='test', port=3306, charset='utf8')
        # 执行语句,游标对象
        self.cursor = self.db.cursor()
        self.df =  open("maoyan.csv", "w", newline="")

    def process_item(self, item, spider):
        t = (item['name'], item['actors'], item['releasetime'])
        sql = 'insert into maoyan values (%s, %s, %s)'
        self.cursor.execute(sql, t)
        self.db.commit()
        return item

    def close_spider(self, spider):
        self.cursor.close()
        self.db.close()
        print('退出爬虫')


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

sunnuan01

一起学习,共同进步

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值