本次爬取的网址为:http://www.yy6080.cn/vodtypehtml/1.html(推荐使用谷歌浏览器,方便看源码)
首先在启动命令行,创建爬虫项目:
scrapy startproject NewVideoMovie
然后:
cd NewVideoMovie
最后创建spider:
scrapy genspider spider http://www.yy6080.cn/vodtypehtml/1.html
创建完成后的结果:
dao文件下面的两个py文件用来连接数据库。
这样基本的爬虫框架就有了,让我们继续下一步:
建立数据库db_newvideomovie_data,同时创建两张表。(看表头就可以了,列表里的信息是爬去成功后的)
item.py代码:
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class NewvideomovieItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
filmName = scrapy.Field() #电影名称
filmRanking = scrapy.Field() #电影评分
filmType = scrapy.Field() #电影类型
filmHref = scrapy.Field() #电影链接
nextURL = scrapy.Field() #下一页的连接
nextPage = scrapy.Field()
#二级页面
filminfo_name = filmName
filminfo_director = scrapy.Field() #电影导演
filminfo_scriptwriter = scrapy.Field() #电影编剧
filminfo_protagonist = scrapy.Field() #电影主演
filminfo_type = scrapy.Field() #电影类型
filminfo_country = scrapy.Field() #制片国家
filminfo_language = scrapy.Field() #语言
filminfo_releasetime = scrapy.Field() #发行时间
filminfo_ranking = scrapy.Field() #电影评分
filminfo_content = scrapy.Field() #剧情介绍
pass
pipelines.py代码:
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
from .dao.taskdao import TaskDao
class NewvideomoviePipeline(object):
def process_item(self, item, spider):
s = TaskDao()
s.create((item['filmName'],item['filmRanking'],item['filmType'],item['filmHref']))
print('输出管道数据')
print(item['filmName'])
print(item['filmRanking'])
print(item['filmType'])
print(item['filmHref'])
setting.py代码:
# -*- coding: utf-8 -*-
# Scrapy settings for NewVideoMovie project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en