使用Xpath提取猫眼电影标题，演员及剧情信息

最新推荐文章于 2024-08-08 10:33:11 发布

OTF893

最新推荐文章于 2024-08-08 10:33:11 发布

阅读量400

点赞数 1

文章标签： html chrome javascript python

本文链接：https://blog.csdn.net/weixin_62484294/article/details/121288244

版权

该博客展示了如何利用Python的requests、lxml和fake_useragent库爬取猫眼电影网站的数据，包括电影名称、类型、演员和剧情梗概。通过设置User-Agent避免被识别为机器人，并使用XPath解析HTML页面。程序还包含一个控制器方法，根据用户输入的页数获取相应数据。

摘要由CSDN通过智能技术生成

from lxml import etree
import requests
from time import sleep
import os
from fake_useragent import UserAgent
path=os.getcwd()+"/fake_useragent_0.1.11.json"
ua=UserAgent(path=path)
#发送请求的方法
def get_html(url):
    header = {'User-Agent': ua.chrome}
    resp = requests.get(url,headers=header)
    sleep(10)
    #如果返回码为200那么正常返回给resp.text 否则返回空
    if resp.status_code==200:
        resp.encoding='utf-8'
        return resp.text
    #默认就是None 可有可无
    else:
        return None
    #解析数据，把相应的数据返回整个页面数据的
def get_list(html):
    e=etree.HTML(html)  #创建xpaht对象
    #对页面上全部电影连接的 xpath提取 语句
    all_a=e.xpath('//div/a[@data-act="movies-click"]/@href')
    #返回给all_a
    return all_a
#提取单个页面里数据
def get_index(html):
    e=etree.HTML(html)  # 创建xpath对象
    #因为返回的是列表，''.join方法将列表转为字符串
    name = ''.join(e.xpath('//h1[@class="name"]/text()'))# 获取h1标签里面的name
    # li标签有三个第0个索引就够我们使用，所以我们填写[0]获取文本类型
    types = ''.join(e.xpath('//li[@class="ellipsis"]/a/text()'))  # 获取li标签的
    # 演员表标签内容
    actors = ''.join(e.xpath('//li[@class="celebrity actor"]/div/a/text()'))  # 获取span标签的dra
    #获取剧情梗概 xpath方法
    plot=''.join(e.xpath('//div[@class="mod-content"]/span/text()'))
    # 调用format_actors方法并添加演员表内容进format_actors里
    actor = format_actors(actors)
    return f'电影名：{name} 类型：{types} 演员：{actor}  梗概：{plot}'
#过滤演员表的方法
def format_actors(a_list):  #这里就是一个演员表的过滤
    #创建一个set()集合，避免切掉过多的演员信息，筛选重复的
    actor_set=set()
    #对演员表进行遍历 用a保存
    for a in a_list:
        #重复就不添加了，没有就将遍历的a文本内容添加进去 #去空格操作
        actor_set.add(a.strip())
    #最后返回给集合
    return actor_set
#控制器方法
def start():
    num = int(input('请输入要获取数据的页数:'))
    for i in range(num):
        url = f'https://www.maoyan.com/films?showType=3&offset={i*30}'
        html=get_html(url)
        all_href=get_list(html) #解析获取所有all_href
        for a in all_href:
            sleep(10)
            url=f'https://www.maoyan.com{a}'
            index_html=get_html(url)
            info=get_index(index_html)
            print(info)
if __name__ == '__main__':
    start()

OTF893

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
使用Xpath提取猫眼电影标题，演员及剧情信息

from lxml import etreeimport requestsfrom time import sleepimport osfrom fake_useragent import UserAgentpath=os.getcwd()+"/fake_useragent_0.1.11.json"ua=UserAgent(path=path)#发送请求的方法def get_html(url): header = {'User-Agent': ua.chrome} resp.
复制链接

扫一扫