抓取美拍的数据并不算是很难 关键是他的视频url的加密算法是有点难搞。
打开美拍的网址我们查看一下源代码,他的网页加载方式跟其它的网站差不多,video_url也是在源代码中,但是我们仔细看,诺就是下边这一串,是人都能猜测这应该是是他的video_url的地址,但是经过某种加密或者编码,根据我的经验我猜是base64,自己有一套算法在里面添加一些随机字符串
我把我写的代码贴下 ,有兴趣的同法可以尝试下:
items.py
class MeipaiItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
cut_url = scrapy.Field()
create_time = scrapy.Field()
video_url = scrapy.Field()
title = scrapy.Field()
author = scrapy.Field()
spider
# -*- coding, utf-8 -*-
import scrapy
import json
import base64
from Meipai.items import MeipaiItem
import logging
class MeipaiSpider(scrapy.Spider):
name = 'meipai'
allowed_domains = ['meipai.com']
start_urls = ['http://www.meipai.com/']
offset = 1
MeiPai = [
('搞笑', '13'),
('明星', '16'),
('高颜值','474'),
('舞蹈', '5872239354896137479'),
('精选', '488'),
('音乐', '5871155236525660080'),
('美食', '5870490265939297486'),
('时尚', '27'),
('美状', '6161763227134314911'),
('吃秀', '5871963671268989887'),
('宝宝', '5864549574576746574'),
('创意', '5875185672678760586'),
('游戏', '5879621667768487138'),
('体育', '5872639793429995335'),
('娱乐','6204189999771523532'),
]
def parse(self, response):
for channel,id in self.MeiPai:
JsonUrl = 'http://www.meipai.com/topics/hot_timeline?page=1&count=24&tid={}'.format(id)
yield scrapy.Request(url=JsonUrl,callback=self.parse_item)
def system(self,string_num):
return str(int(string_num.upper(), 16))
def parse_item(self,response):
item = MeipaiItem()
OriginalHtml = json.loads(response.body.decode('utf-8'))
NowHtml = OriginalHtml.get('medias')
for NowData in NowHtml:
# print(NowData)
CutPicture = NowData.get('cover_pic')
item['cut_url'] = CutPicture
CreateTime = NowData.get('created_at')
item['create_time'] = CreateTime
Title = NowData.get('caption')
if Title:
item['title'] = Title
else:
return
User = NowData.get('user').get('screen_name')
if User:
item['author'] = User
else:
return
try:
EncryptionVideoUrl = NowData.get('video')
Num = self.system(EncryptionVideoUrl[:4][::-1])
StartNum = Num[0]
StartCount = Num[1]
EndNum = Num[2]
EndCount = Num[3]
AddendNum = -(int(EndCount) + int(EndNum))
HeaderNUm = int(StartCount) + int(StartNum) + int(4)
TrueMindPart = EncryptionVideoUrl[HeaderNUm:AddendNum]
TrueStartUrl = EncryptionVideoUrl[4:4 + int(StartNum)]
TrueEndtUrl = EncryptionVideoUrl[(-int(EndNum)):]
DealWithFinalUrl = TrueStartUrl + TrueMindPart + str(TrueEndtUrl)
# Mp4Url = base64.b64decode(DealWithFinalUrl)
FinalMp4UrlData = (str(base64.b64decode(DealWithFinalUrl), 'utf-8'))
item['video_url'] = FinalMp4UrlData
except Exception as e:
logging.info(e)
return
# print(CutPicture,CreateTime,User,Title,Mp4Url)
if not item['video_url']:
return
yield item
piplines.py
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import json
import logging
class MeipaiPipeline(object):
def __init__(self):
self.name = open('data.json','w')
def process_item(self, item, spider):
content = json.dumps(dict(item)) + '\n'
logging.info(type(content))
self.name.write(content.encode('utf-8').decode('unicode-escape'))
return item
def close_item(self,spider):
self.name.close()
我们可以看下日志信息,视频的url,现在是我们常见的MP4格式哈,也是可以在浏览器打开的,可以请求的。
我们随便找一个拿到浏览器,是没有问题的:
然后我们这些数据的处理方式,我们可以保存到本地,也可以存放数据库,感兴趣的伙伴可以吧视频下载下来,我这里仅仅是以文件的格式保存到了本地: