2021.10.6使用scrapy
win + R打开控制命令行输入cmd
进入命令提示符
scrapy startproject movie
scrapy genspider meiju meijutt.com
找到movie文件夹
scrapy_test/
├── init.py
├── init.pyc
├── items.py
├── pipelines.py
├── settings.py
├── settings.pyc
└── spiders
├── dmoz_spider.py
├── dmoz_spider.pyc
├── init.py
└── init.pyc
#内部结构如上
scrapy.cfg: 项目配置文件
tutorial/: 项目python模块, 呆会代码将从这里导入
tutorial/items.py: 项目items文件
tutorial/pipelines.py: 项目管道文件
tutorial/settings.py: 项目配置文件
tutorial/spiders: 放置spider的目录
编写meiju.py
import scrapy
import sys
sys.path.append(’…/’) #不在同一文件夹下,调用.py文件
from items import MovieItem
class MeijuSpider(scrapy.Spider):
name = ‘meiju’
allowed_domains = [‘meijutt.com’]
start_urls = [‘https://m.meijutt.tv/’]
def parse(self, response):
movies=response.xpath('//ul[@class="top-list fn-clear"]/li')
for each_movie in movies:
item=MovieItem()
item['name']=each_movie.xpath('./h5/a/@title').extract()[0]
yield item
pass
编写items.py
import scrapy
class MovieItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
name = scrapy.Field()
pass
编写settings.py
添加
ITEM_PIPELINES={‘movie.pipelines.MoviePipeline’:100}
RETRY_TIMES:8
DEPIH_LIMIT:2
编写pipeline.py
from itemadapter import ItemAdapter
class MoviePipeline(object):
def process_item(self, item, spider):
with open(“my_meiju.txt”,‘a’) as fp:
fp.write(item[‘name’]).encode(“utf8”) + ‘\n’
重回“命令提示符”中
进入爬虫根目录
scrapy crawl meiju
原文链接:https://blog.csdn.net/skyie53101517/article/details/64147956