价格是比价中最重要地信息,价格信息量大,并且是不断变化的,尽可能准确快速地抓取各个渠道电影及优惠信息是比价软件的核心竞争力。
爬取时分城市爬取,先从DB中加载城市cinema,然后通过cinema_id直接获取影院所有价格信息,过程类似我们平时直接进入影院查看价格的行为。
猫眼
猫眼价格h5页面需借助selenium模拟chrome进行爬取,爬取速度比价慢。观察发现web端价格信息与h5/app价格一样,并且不用借助selenium,能达到快速爬取的效果。
借助selenium爬取时,价格信息通过ajax加载,借助BrowserMiddleware提供的相关函数可等待期待的内容加载完毕后再借助PriceMiddleware解析价格信息。
H5抓取源码:https://github.com/improvejin/hyspider/tree/master/hyspider/spiders/price/mt.py
MTPriceMiddleware源码:https://github.com/improvejin/hyspider/tree/master/hyspider/middlewares/price/mt.py
Web抓取源码:https://github.com/improvejin/hyspider/tree/master/hyspider/matchers/cinema.py
MT2PriceMiddleware源码:https://github.com/improvejin/hyspider/tree/master/hyspider/middlewares/price/mt2.py
BrowserMiddleware源码:https://github.com/improvejin/hyspider/tree/master/hyspider/middlewares/browser.py
猫眼价格信息采用了woff字体加密防爬,并且woff基础库随请求随机变化,因此在解析woff price之前必须先下载与当前响应匹配的woff基础库文件,然后才能解析出price信息,具体解密过程参考猫眼价格解密。
woff价格解密源码:https://github.com/improvejin/hyspider/tree/master/hyspider/utils/font_util.py#MTFontParser
#猫眼价格信息
CREATE TABLE `price_mt` (
`cinema_id` int(11) unsigned NOT NULL,
`movie_id` int(11) unsigned NOT NULL,
`show_date` date NOT NULL comment '放映日期',
`begin` varchar(10) NOT NULL comment '放映开始时间',
`end` varchar(10) NOT NULL DEFAULT '' comment '影片结束时间',
`language` varchar(20) NOT NULL DEFAULT '' comment '影片语言',
`hall` varchar(50) NOT NULL DEFAULT '' comment '放映厅',
`price` float NOT NULL DEFAULT '0' comment '价格',
`create_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`update_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`cinema_id`,`movie_id`,`show_date`,`begin`),
KEY `idx_cinema_id` (`cinema_id`) USING BTREE,
KEY `idx_movie_id` (`movie_id`) USING BTREE,
KEY `idx_show_date` (`show_date`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
淘票票
淘票票同猫眼一样,H5需借助Selenium,难爬取,可直接爬取web平台价格信息,只是淘票票平台优惠信息有时只有手机平台才有。淘票票价格信息没有加密。
H5抓取源码:https://github.com/improvejin/hyspider/tree/master/hyspider/spiders/price/tb.py
TBPriceMiddleware源码:https://github.com/improvejin/hyspider/tree/master/hyspider/middlewares/price/tb.py
Web抓取源码:https://github.com/improvejin/hyspider/tree/master/hyspider/spiders/price/tb2.py
TB2PriceMiddleware源码:https://github.com/improvejin/hyspider/tree/master/hyspider/middlewares/price/tb2.py
#淘票票价格信息
CREATE TABLE `price_tb` (
`cinema_id` int(11) NOT NULL,
`movie_id` int(11) NOT NULL,
`show_date` date NOT NULL,
`begin` varchar(10) NOT NULL,
`end` varchar(10) NOT NULL DEFAULT '',
`language` varchar(20) NOT NULL DEFAULT '',
`hall` varchar(50) NOT NULL DEFAULT '',
`price` float NOT NULL DEFAULT '0',
`create_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`update_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`cinema_id`,`movie_id`,`show_date`,`begin`),
KEY `idx_cinema_id` (`cinema_id`) USING BTREE,
KEY `idx_movie_id` (`movie_id`) USING BTREE,
KEY `idx_show_date` (`show_date`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
糯米
糯米价格最好抓取,直接通过h5页面就可获取某个影院所有价格信息,同样没有任何加密。
H5抓取源码:https://github.com/improvejin/hyspider/tree/master/hyspider/spiders/price/lm.py
#city=289, cinemaId=66影院价格信息
https://mdianying.baidu.com/cinema/detail?cinemaId=66&from=webapp&sub_channel=wise_shoubai_movieScheduleWeb&c=289
#糯米价格信息
CREATE TABLE `price_lm` (
`cinema_id` int(11) unsigned NOT NULL,
`movie_id` int(11) unsigned NOT NULL,
`show_date` date NOT NULL,
`begin` varchar(10) NOT NULL,
`end` varchar(10) NOT NULL DEFAULT '',
`language` varchar(20) NOT NULL DEFAULT '',
`hall` varchar(50) NOT NULL DEFAULT '',
`price` float NOT NULL DEFAULT '0',
`create_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`update_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`cinema_id`,`movie_id`,`show_date`,`begin`),
KEY `idx_cinema_id` (`cinema_id`) USING BTREE,
KEY `idx_movie_id` (`movie_id`) USING BTREE,
KEY `idx_show_date` (`show_date`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
匹配
有了电影和影院匹配表,价格匹配时将渠道价格表与电影影院表按放映日期和放映开始时间full join起来就得到了所有影院所有电影在各个渠道的价格信息。
汇总源码:PriceManager#save_all_channels_price
#汇总了三个渠道的价格信息
CREATE TABLE `price` (
`cinema_id` int(11) NOT NULL,
`movie_id` int(11) NOT NULL,
`show_date` date NOT NULL,
`begin` varchar(10) NOT NULL,
`end` varchar(10) NOT NULL DEFAULT '',
`language` varchar(20) NOT NULL DEFAULT '',
`hall` varchar(50) NOT NULL DEFAULT '',
`price_mt` float NOT NULL DEFAULT '0',
`price_tb` float NOT NULL DEFAULT '0',
`price_lm` float NOT NULL DEFAULT '0',
`create_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`update_time` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`cinema_id`,`movie_id`,`show_date`,`begin`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
# 语义上同三个渠道价格信息full join,但mysql不支持full join,需要用left join union right join替换full join
replace into price(cinema_id, movie_id, show_date, begin, end, language, hall, price_mt, price_tb, price_lm)
select
COALESCE(mt_price_tmp.cinema_id, tb_price_tmp.cinema_id, lm_price_tmp.cinema_id) as cinema_id,
COALESCE(mt_price_tmp.movie_id, tb_price_tmp.movie_id, lm_price_tmp.movie_id) as movie_id,
COALESCE(mt_price_tmp.show_date, tb_price_tmp.show_date, lm_price_tmp.show_date) as show_date,
COALESCE(mt_price_tmp.begin, tb_price_tmp.begin, lm_price_tmp.begin) as begin,
COALESCE(mt_price_tmp.end, tb_price_tmp.end, lm_price_tmp.end) as end,
COALESCE(mt_price_tmp.language, tb_price_tmp.language, lm_price_tmp.language) as language,
COALESCE(mt_price_tmp.hall, tb_price_tmp.hall, lm_price_tmp.hall) as hall,
COALESCE(mt_price_tmp.price, 0) as price_mt,
COALESCE(tb_price_tmp.price, 0) as price_tb,
COALESCE(lm_price_tmp.price, 0) as price_lm
from
(
select cinema.id_mt as cinema_id, movie.id_db as movie_id, show_date, begin, end, language, hall, price
from cinema join price_mt on ( cinema.id_mt=price_mt.cinema_id) join movie on (movie.id_mt=price_mt.movie_id)
) mt_price_tmp
full outer join
(
select cinema.id_mt as cinema_id, movie.id_db as movie_id, show_date, begin, end, language, hall, price
from cinema join price_tb on ( cinema.id_tb=price_tb.cinema_id) join movie on (movie.id_tb=price_tb.movie_id)
) tb_price_tmp on (mt_price_tmp.cinema_id = tb_price_tmp.cinema_id and mt_price_tmp.movie_id = tb_price_tmp.movie_id
and mt_price_tmp.show_date = tb_price_tmp.show_date and mt_price_tmp.begin = tb_price_tmp.begin)
full outer join
(
select cinema.id_mt as cinema_id, movie.id_db as movie_id, show_date, begin, end, language, hall, price
from cinema join price_lm on ( cinema.id_lm=price_lm.cinema_id) join movie on (movie.id_lm=price_lm.movie_id)
) lm_price_tmp on (COALESCE(mt_price_tmp.cinema_id, tb_price_tmp.cinema_id) = lm_price_tmp.cinema_id and COALESCE(mt_price_tmp.movie_id, tb_price_tmp.movie_id) = lm_price_tmp.movie_id
and COALESCE(mt_price_tmp.show_date, tb_price_tmp.show_date) = lm_price_tmp.show_date and COALESCE(mt_price_tmp.begin, tb_price_tmp.begin) = lm_price_tmp.begin)
代理
价格信息数据量大,频繁变动,每天都要爬取,需要借助代理防止被屏蔽爬取。
使用ProxyMiddleware从http://123.207.35.36:5010/get/获取免费代理,替换掉scrapy请求时的IP。
ProxyMiddleware源码:https://github.com/improvejin/hyspider/tree/master/hyspider/middlewares/ipproxy.py
http://123.207.35.36:5010/get/提供的代理是从各个代理服务提供方抓取的免费代理,详情参考https://github.com/jhao104/proxy_pool
http://icanhazip.com/可返回请求者IP