影院信息在各渠道都是按城市分类,通过城市id分页查询,因此在抓取影院前先抓取城市信息,城市信息基本不变,只需抓取一次即可。
猫眼
没有从猫眼找到直接获取城市列表的API, 所以基于selenium模拟chrome访问猫眼首页,获取首页html源码,然后从html中提取所有城市。
city_mt源码:https://github.com/improvejin/hyspider/tree/master/hyspider/spiders/city/mt.py
#存储猫眼城市信息
CREATE TABLE `city_lm` (
`id` int(11) unsigned NOT NULL,
`g` char(1) NOT NULL DEFAULT '',
`name` varchar(10) NOT NULL DEFAULT '',
`create_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`update_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
淘票票
同猫眼一样,淘票票也是基于selenium获取首页html源码解析出城市信息。
city_tb源码:https://github.com/improvejin/hyspider/tree/master/hyspider/spiders/city/tb.py
#存储淘宝城市信息
CREATE TABLE `city_tb` (
`id` int(11) unsigned NOT NULL,
`g` char(1) NOT NULL DEFAULT '' commnet '城市分组',
`name` varchar(10) NOT NULL DEFAULT '',
`create_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`update_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
糯米
糯米则比较简单,直接有获取城市列表的API,但是使用scrapy抓取时需要变换UserAgent,糯米会屏蔽掉基于scrapy agent的抓取。
city_lm源码:https://github.com/improvejin/hyspider/tree/master/hyspider/spiders/city/lm.py
#糯米影院城市api
http://dianying.baidu.com/common/city/citylist?hasLetter=false&isjson=true
#存储糯米城市信息
CREATE TABLE `city_lm` (
`id` int(11) unsigned NOT NULL,
`g` char(1) NOT NULL DEFAULT '' commnet '城市分组',
`name` varchar(10) NOT NULL DEFAULT '',
`create_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`update_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
城市匹配
城市匹配比较简单,直接基于name匹配即可。
#存储匹配后的城市信息
CREATE TABLE `city` (
`id_mt` int(11) NOT NULL,
`id_tb` int(11) NOT NULL,
`id_lm` int(11) NOT NULL,
`g` char(255) NOT NULL,
`name` varchar(10) CHARACTER SET utf8 NOT NULL,
`fly` bit(1) NOT NULL DEFAULT b'0',
`create_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`update_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`id_mt`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
#以city_mt为标准,基于name字段join
replace city(id_mt, id_tb, id_lm, g, name)
select city_mt.id, city_tb.id, city_lm.id, city_mt.g, city_mt.name from city_mt join city_tb on city_mt.name=city_tb.name
join city_lm on city_mt.name=city_lm.name;