利用selenium headless chrome爬取豆瓣top100电影详细内容
由于现在phantomjs都不维护了,所以也可以使用无头浏览器来进行操作,不过注意这种无头浏览器模式,在mac、linux上,版本号在59+以上,才支持这种模式,在windows上,要求版本号在60+以上,才支持这种模式。
豆瓣电影的详细信息是通过Ajax加载的,你不能通过网页源代码直接提取到信息,有两种方法可以提取到这些信息
- 访问url,通过谷歌浏览器的抓包工具提取到这些信息存储的真正的url(在XHR下),找到url的规律全部提取
- 应用selenium + headless chrome(不推荐)
今天我们要讲的是第二种方法(第一种方法练习我QQ,在下方)
我们要使用的工具
- selenium
- headless chrome
- xpath(Beautiful,re都可以)
- mongodb数据库
操作的思路
1.导入用到的包,创建数据库
from selenium import webdriver from selenium.webdriver.chrome.options import Options from lxml import etree import pymongo import time
client = pymongo.MongoClient('localhost',27017) 数据库的名字doubanaa doubanaq = client['doubanaa'] 数据库的集合名contentone content = doubanaq['contentone']
2.创建浏览器对象
chrome_options = Options() chrome_options.add_argument('--headless') chrome_options.add_argument('--disable-gpu') path是你chrome浏览器的驱动(最后必须到.exe文件) path = r'D:\软件\chrome\chrome驱动\chromedriver_win32\chromedriver.exe' browser = webdriver.Chrome(executable_path=path, chrome_options=chrome_options)
3.获取网页源代码
url = 'https://movie.douban.com/typerank?type_name=%E7%88%B1%E6%83%85&type=13&interval_id=100:90&action=' browser.get(url) #给一个3秒的延迟 time.sleep(3)
#输出获取的源代码
print(browser.page_source)
4.实现滑块下拉功能,获取全部电影的信息
#我这里就加载10次,你可以加载更多 for i in range(10): #将滑块拉到最底部 js = "var q=document.documentElement.scrollTop=10000" #执行操作(obj.execute_script方法) browser.execute_script(js) time.sleep(3)
4.通过xpath进行解析,获取内容,并且存入到mongodb数据库(很简单)
tree = etree.HTML(browser.page_source) res = tree.xpath('//div[@class = "movie-info"]') item = {} for re in res: title = re.xpath('.//div/span/a/text()')[0] try: number = re.xpath('.//div[@class = "movie-name"]/span[3]/text()')[0] except: number = re.xpath('.//div[@class = "movie-name"]/span[2]/text()')[0] actor = re.xpath('.//div[@class = "movie-crew"]/text()')[0] types = re.xpath('.//div[@class = "movie-misc"]/text()')[0] score = re.xpath('.//div[@class = "movie-rating"]/span[2]/text()')[0] item = { '电影名字':title, '电影排名':number, '演员':actor, '电影类型':types, '评分':score, } #插入数据库 content.insert(item)
全部代码(不喜欢分函数的写法。。。)
# @Time : 2019/2/20 20:35 from selenium import webdriver from selenium.webdriver.chrome.options import Options from lxml import etree import pymongo import time def main(): client = pymongo.MongoClient('localhost',27017) doubanaq = client['doubanaa'] content = doubanaq['contentone'] chrome_options = Options() chrome_options.add_argument('--headless') chrome_options.add_argument('--disable-gpu') path = r'D:\软件\chrome\chrome驱动\chromedriver_win32\chromedriver.exe' # 创建一个浏览器对象 browser = webdriver.Chrome(executable_path=path, chrome_options=chrome_options) url = 'https://movie.douban.com/typerank?type_name=%E7%88%B1%E6%83%85&type=13&interval_id=100:90&action=' browser.get(url) time.sleep(3) #print(browser.page_source) #下拉操作 for i in range(300): js = "var q=document.documentElement.scrollTop=10000" browser.execute_script(js) time.sleep(3) time.sleep(5) browser.save_screenshot('db.png') tree = etree.HTML(browser.page_source) res = tree.xpath('//div[@class = "movie-info"]') item = {} for re in res: title = re.xpath('.//div/span/a/text()')[0] try: number = re.xpath('.//div[@class = "movie-name"]/span[3]/text()')[0] except: number = re.xpath('.//div[@class = "movie-name"]/span[2]/text()')[0] actor = re.xpath('.//div[@class = "movie-crew"]/text()')[0] types = re.xpath('.//div[@class = "movie-misc"]/text()')[0] score = re.xpath('.//div[@class = "movie-rating"]/span[2]/text()')[0] item = { '电影名字':title, '电影排名':number, '演员':actor, '电影类型':types, '评分':score, } content.insert(item) #print(title,number,actor,types,score) if __name__ == '__main__': main()
结果
有不懂可以练习我QQ:1428090104
人生苦短,我用python