[Scrapy爬虫]自己修改常用网站，去广告，省时间

最新推荐文章于 2022-02-23 19:37:44 发布

织网者Eric

最新推荐文章于 2022-02-23 19:37:44 发布

阅读量2.8k

点赞数 1

分类专栏：爬虫文章标签：爬虫广告美剧

本文链接：https://blog.csdn.net/juwikuang/article/details/72809243

版权

博主因某美剧网站广告繁多且页面分割导致访问不便，采用Scrapy爬虫编写代码，爬取并生成无广告的页面，提高了浏览体验。提供了Demo下载和依赖说明，并在2017年7月31日进行了更新，以应对网站代码变动。

摘要由CSDN通过智能技术生成

介绍

用Scrapy爬了某美剧网站，本来不想爬的。但是这个网站广告太多了，而且最近还把一个页面分成了六个。我每次访问都要打开六个页面，看很多广告，我的破电脑经常卡住，我都快疯了。于是，我自己做了爬虫去爬，爬完了以后，生成一个个没有广告的页面，顿时心情好了 ^_^。

修改之前

看，都是广告，而且把资源按天分成了六页。

于是，我自己动手，自定义（客製化, customise）了这个网站。下图是效果。

修改之后

可见自定义以后，页面干净多了。

Demo

Demo下载地址：
http://download.csdn.net/detail/juwikuang/9855793

依赖：Python，Scrapy
运行的时候，只要点run.bat就行了。

代码

#!/usr/bin/python  

# -*- coding: utf-8 -*-
"""
Spider against TTMEIJUT.COM
Previously in ttmeiju.com. All the latest TV shows and movies 
are presentedin one single page. it is very convinent for users.
However, since maybe last year, ttmeiju splited one single page into
six pages, which it is very anoiying to me.

I miss the good old days when there was only one page......

Do you? If you do, this script it for you.

Created on Sun May 28 12:09:05 2017 

@author: Eric Chow 
""" 
import scrapy
from scrapy import signals 

class LatestSpider(scrapy.Spider):
    name = "latest" 
    start_urls = [
        "http://www.ttmeiju.com/latest-0.ht