爬取目标
爬取电影的相关数据:name,nameFrn,year,cover,runtime,types,releaseDate,rating,rateNum,directors,writers,country,summary,stars
电影评论comments的相关数据:user,userId,rating,content,time
动态网页爬取
网页中的评分等信息是通过ajax动态生成的,无法从网页源码中直接爬取
通过selenium和chromedriver可以模拟Chrome浏览器的访问来从动态生成网页上爬取数据
使用options配置webdriver使用无头的Chrome
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
driver = webdriver.Chrome(options=chrome_options)
爬取过程
连接MongoDB数据库从profile中获取目标url
通过xpath定位网页元素并获取text、src等数据
最后将数据简要处理后存入details,将comments的相关数据存入comments
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from pymongo import MongoClient
import time
import random
import re
import json
with open('setting.json') as f:
setting = json.load(f)
clint=MongoClient("mongodb://{}:27017/movie".format(setting['host']),username=setting['username'],password=setting['password'])
db=clint["movie"]
profile=db["profile"]
details=db["details"]
comments=db["comments"]
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
driver = webdriver.Chrome(options=chrome_options)
docs=profile.find({
"source":"mtime"})
for doc in docs<