【创新实训】爬虫开发记录（3）：爬取时光网详情页

最新推荐文章于 2024-05-01 14:35:23 发布

??273?

最新推荐文章于 2024-05-01 14:35:23 发布

阅读量352

点赞数

文章标签： selenium ajax

本文链接：https://blog.csdn.net/subzero_273/article/details/106807360

版权

本文记录了如何使用selenium和ajax技术爬取时光网电影详情页的数据，包括电影信息和评论，详细介绍了爬取目标、动态网页的处理方法以及爬取过程。

摘要由CSDN通过智能技术生成

爬取目标

爬取电影的相关数据：name，nameFrn，year，cover，runtime，types，releaseDate，rating，rateNum，directors，writers，country，summary，stars
电影评论comments的相关数据：user，userId，rating，content，time
在这里插入图片描述

动态网页爬取

网页中的评分等信息是通过ajax动态生成的，无法从网页源码中直接爬取
在这里插入图片描述
通过selenium和chromedriver可以模拟Chrome浏览器的访问来从动态生成网页上爬取数据
使用options配置webdriver使用无头的Chrome

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
driver = webdriver.Chrome(options=chrome_options)

爬取过程

连接MongoDB数据库从profile中获取目标url
通过xpath定位网页元素并获取text、src等数据
最后将数据简要处理后存入details，将comments的相关数据存入comments

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from pymongo import MongoClient
import time
import random
import re
import json


with open('setting.json') as f:
    setting = json.load(f)

clint=MongoClient("mongodb://{}:27017/movie".format(setting['host']),username=setting['username'],password=setting['password'])
db=clint["movie"]
profile=db["profile"]
details=db["details"]
comments=db["comments"]

chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
driver = webdriver.Chrome(options=chrome_options)

docs=profile.find({
   "source":"mtime"})
for doc in docs<