第7章作业

在安装的窗口输入

pip install scrapy --user -i http://pypi.douban.com/simple --trusted-host pypi.douban.com
 

cd 你的文件夹路径

py -m scrapy startproject TipDMSpider

cd TipDMSpider

在TipDMSpider文件夹里面

更改脚本items.py

import scrapy
class TipdmspiderItem(scrapy.Item):
    title=scrapy.Field()
    text=scrapy.Field()

更改脚本pipelines.py

import pandas as pd
from sqlalchemy import create_engine
class TipdmspiderPipeline(object):
    def _init_(self):
        self.engine=create_engine('mysql+pymysql://root:335210@127.0.0.1:3306/tipdm')
    def process_item(self,item,spider):
        data=pd.DataFrame(dict(item))
        data.to_sql('tipdm_data',self.engine,if_exists='append',index=False)
        data.to_csv('TipDM_data.csv',mode='a+',index=False,sep='|',header=False)
        

py -m scrapy genspider tipdm www.tipdm.com

settings.py添加


ROBOTSTXT_OBEY=False
DOWNLOAD_DELAY=5
ITEM_PIPELINES={
    'TipDMSpider.pipelines.TipdmspiderPipeline':300,
}
HTTPCACHE_ENABLED=True
HTTPCACHE_DIR='D:/class/class/爬虫/7/TipDMSpider'

更改tipdm.py

import scrapy
from scrapy.http import Request
from TipDMSpider.items import TipdmspiderItem

class TipdmSpider(scrapy.Spider):
    name = 'tipdm'
    allowed_domains = ['www.tipdm.com']
    start_urls = ['http://www.tipdm.com/']

    def parse(self, response):
        last_page_num=response.xpath("//div[@class='fpage']/div/a[last()]/text()").extract()
        append_urls=['http://www.tipdm.com/tipdm/tddt/index_%d.html'%i\
                     for i in range(2,int(last_page_num[0])+1)]
        append_urls.append('http://www.tipdm.com/tipdm/tddt')
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36',
        }
        for url in append_urls:
            yield Request(url,callback=self.parse_url,dont_filter=True,headers=headers)
        pass
    def parse_url(self,response):
        urls=response.xpath("//div[@class='item clearfix']/div[1]/h1/a/@herf").extract()
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36',
        }
        for page_url in urls:
            text_url="http://www.tipdm.com"+page_url
            yield Request(text_url, callback=self.parse_text, dont_filter=True, headers=headers)
        pass
    def parse_text(self,response):
        item=TipdmspiderItem()
        item['title']=response.xpath("//div[@class'artTitle']/h1/text()").extract()
        text=response.xpath("//div[@class'artCon']//p/text()").extract()
        texts=" "
        for strings in text:
            texts=texts+strings+"\n"

        yield item


py -m scrapy crawl tipdm

结果如下图 所示

 提示:如果有安装包没有安装使用

pip install 包的名字 --user -i http://pypi.douban.com/simple --trusted-host pypi.douban.com

安装速度会很快

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值