从零打造视频播放网站(1)-数据采集篇

1.数据库设计:

source表(播放源):

id name desc parse show
自增id 播放源名称 播放源描述 解析url 展示名

在这里插入图片描述

video表:

id title desc type area thumb year director introduction actors
视频id 视频名称 视频描述 类型(1:电影,2:电视剧,3:综艺,4:动漫) 地区 缩略图 年份 导演 简介 演员

在这里插入图片描述

play表:

id video_id from url part
自增id 视频id(与video.id关联) 播放源名称(与source.name关联) 播放地址 集数

在这里插入图片描述

2.环境:

  • python 3.6.6
  • 所用模块:
  1. scrapy:编写爬虫
  2. scrapy-redis:与scrapy结合实现增量式爬虫
  3. requests:发送post,get请求
  • windows7系统
  • mysql数据库:存储数据
  • redis数据库

3.创建项目:

scrapy startproject project_name

4. settings.py

LOG_LEVEL="WARNING"
USER_AGENT="Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36"


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'Scrapy_www_vultr1_com (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False


5.爬虫编写:

初始化(连接数据库):

    def __init__(self):
        dict = {}
        dict["host"] = self.MYSQL_HOST
        dict["port"] = self.MYSQL_PORT
        dict["user"] = self.MYSQL_USER
        dict["passwd"] = self.MYSQL_PWD
        dict["db"] = self.MYSQL_DB
        dict["use_unicode"]=True
        dict["charset"]="utf8"
        self.db = MySQLdb.connect(**dict)
        #self.playerconfig()

处理响应:

    def parse(self, response):
        href_list = response.xpath("//ul[@class='stui-header__menu type-slide']/li/a/@href")[1:-1]
        for href in href_list:
            _type=re.search(r"v/(\d*).html",href.get()).group(1)
            yield scrapy.Request(self.domain+href.get(),callback=self.parse_item,meta={"type":_type})

    #处理专题
    def parse_item(self, response):
        a_list=response.xpath("//a[@class='stui-vodlist__thumb lazyload']/@href")
        for a in a_list:
            yield scrapy.Request(
                    self.domain+a.extract(),
                    callback=self.parse_detail,
                    meta={"type": response.meta["type"]}
                )
            #print(self.domain + a.extract() +"---已抓取")

        next_url = response.xpath("//ul[@class='stui-page text-center clearfix']/li/a[text()='下1页']/@href").extract_first(default="")
        last_url = response.xpath("//ul[@class='stui-page text-center clearfix']/li/a[text()='尾页']/@href").extract_first(default="")
        active_url=response.xpath("//ul[@class='stui-page text-center clearfix']/li[@class='hidden-xs active']/a/@href").extract_first(default="")
        if (next_url != None) and active_url!=last_url:
            #print("---------------------")
            print("next_url:"+self.domain + next_url + "---已抓取")
            yield scrapy.Request(self.domain + next_url, callback=self.parse_item,meta={"type":response.meta["type"]})
        else:
            print("数据已爬完.......")

    #处理详情页
    def parse_detail(self,response):
        item={}
        item["table"]="video"
        item["id"]=re.search(r"video/(\d*).html",response.request.url).group(1)
        item['thumb']=response.xpath("//img[@class='lazyload']/@src").get(default="")
        item['title']=response.xpath("//h1[@class='title']/text()").extract_first(default="")
        #item['type']=response.xpath("//p[@class='data ']/a/text()").extract_first()
        item["type"]=response.meta["type"]
        item['area']=response.xpath("//p[@class='data ']/text()[4]").extract_first(default="").replace('\t','')
        item['year'] = response.xpath("//p[@class='data ']/text()[6]").extract_first(default="").replace('\t','').replace('\r\n','')
        item['director']=response.xpath("//p[@class='data'][2]/a/text()").get(default="")
        item['introduction']=""
        sketch=response.xpath("//span[@class='detail-sketch']/text()")
        content=response.xpath("//span[@class='detail-content']/text()")
        if sketch:
            item['introduction'] +=sketch.get(default="")
        if content:
            item['introduction'] += sketch.get(default="")

        item['actors'] =""
        for s in response.xpath("//p[@class='data'][1]/a/text()"):
            item['actors']+=s.extract()+" "
        yield item
		
		#之后略

    #处理视频播放页
    def parse_play(self,response):
        item={}
        item["part"]=response.meta["part"]
        item["video_id"] = response.meta["video_id"]
        item["table"]="play"
        player_data=re.search(r"player_data=(.*?)</script>",response.text).group(1)
        player_data=json.loads(player_data)
        item["from"]=player_data["from"]
        item["url"]=player_data["url"]
        yield item

piplines(数据处理:存入数据库)

    def process_item(self, item, spider):
        #print(item)
        sql=""
        sql+="INSERT INTO "+item["table"]+" SET "
        item.pop("table")
        for k,v in item.items():
            if k=="id" or k=="video_id" or k=="part" or k=="type":
                sql += "`" + k + "`=" + v + ","
            else:
                sql+="`"+k+"`=\""+v+"\","
        sql=sql[:-1]
        #print(sql)
        cursor=spider.db.cursor()
        cursor.execute(sql)
        spider.db.commit()
        return item

6.增量式爬虫:

DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
SCHEDULER_PERSIST = True

REDIS_URL="redis://127.0.0.1:6379" 
  • spider继承RedisSpider
  • redis数据库写入键值:
LPUSH spider_name:start_urls start_url

7.启动爬虫:

scrapy crawl spider_name

8.总结:

  • re.match与re.search区别:match是从字符串开头开始匹配,search则不是
  • response.xpath返回的是SelectorList ,是一个包含Selector的列表
class SelectorList(list):
	    def getall(self):
        """
        Call the ``.get()`` method for each element is this list and return
        their results flattened, as a list of unicode strings.
        """
        return [x.get() for x in self]
    extract = getall

    def get(self, default=None):
        """
        Return the result of ``.get()`` for the first element in this list.
        If the list is empty, return the default value.
        """
        for x in self:
            return x.get()
        else:
            return default
    extract_first = get
class Selector(object):
	def get(self):
        """
        Serialize and return the matched nodes in a single unicode string.
        Percent encoded content is unquoted.
        """
        try:
            return etree.tostring(self.root,
                                  method=self._tostring_method,
                                  encoding='unicode',
                                  with_tail=False)
        except (AttributeError, TypeError):
            if self.root is True:
                return u'1'
            elif self.root is False:
                return u'0'
            else:
                return six.text_type(self.root)
    extract = get

    def getall(self):
        """
        Serialize and return the matched node in a 1-element list of unicode strings.
        """
        return [self.get()]
发布了43 篇原创文章 · 获赞 77 · 访问量 32万+
展开阅读全文

没有更多推荐了,返回首页

©️2019 CSDN 皮肤主题: 终极编程指南 设计师: CSDN官方博客

分享到微信朋友圈

×

扫一扫,手机浏览